An efficient XML query pattern mining algorithm for ebXML applications in e-commerce

.


INTRODUCTION
XML (Cunningham, 2005) has become the de facto standard for data representation and exchange in ecommerce.The self-describing property empowers XML to represent data without losing semantics, and the semistructure nature allows XML to model a wide variety of data.As a result, in e-commerce, many applications utilize XML and then follow the ebXML specifications (Bio, 2003) to exchange their data over the Internet.In consequence, the rapid growth of XML data in ecommerce has provided the impetus to design and develop the systems that can efficiently store and query XML data for ebXML applications.ebXML (Bio, 2003) is a set of specifications which are designed by OASIS (Moberg, 2007) for companies to exchange data in ecommerce.These specifications together enable a modular electronic business framework and are designed based on XML technology.Following the ebXML specifications, companies have a standard method to exchange business messages, communicate data, and business rules in e-commerce.These business messages, communicate data, and rules are described by XML and with the same data frame between different companies.Therefore, most of XML data in ebXML applications has the same standard data structure and E-mail: apple@teamail.ltu.edu.tw, apple@mail.ltu.edu.tw. Tel: 886-4-23892088. Fax: 886-4-23895293 Author agree that this article remain permanently open access under the terms of the Creative Commons Attribution License 4.0 International License results in most of their queries may have the same structure with query XML data.
Since XML data in ebXML applications can be treated as trees with elements, attributes, and texts, the query languages, that is, XPath (Clark, 1999) and XQuery (Boag, 2010) are tree patterns with selection predicates on multiple elements that specify the tree-structured relationships.Thus, matching tree patterns against XML data is a core operation in XML query evaluation.This operation can be expensive since it involves navigation through the tree structure of XML data.As a result, the research efforts (Kwon et al., 2008;Lu et al., 2005;Raj et al., 2007) have been focused on the efficient evaluation of tree paths in XML queries.
Another approach (Bei et al., 2009;Chen et al., 2006;Gu et al., 2007;Yang et al., 2008) of improving XML query performance is to discover frequent XML query patterns and to design an index mechanism or cache the results of these patterns.Bei et al. (2009) and Yang et al. (2008) design a transaction summary data structure (that is, the global tree) to merge all of XML user query patterns.At the global tree, the XML candidate query sub trees are generated and their frequencies are thus counted by executing the tree-join process or database scans.As a result, the frequent XML query patterns are efficiently discovered on the processed global tree.In addition, in order to reduce the number of XML candidate query sub trees, Bei et al. (2009) and Yang et al. (2008) use the minimum support constraint to prune the infrequent XML query patterns on the global tree.
The existing approaches (Bei et al., 2009;Chen et al., 2006;Gu et al., 2007;Yang et al., 2008) may not be suitable to discover the frequent XML query patterns in ebXML applications and thus, degrade the system performance.Bei et al. (2009) and Yang et al. (2008) generate the XML candidate query sub trees from the global tree and use costly containment testing to prune the invalid candidate ones for the queries.However, in ebXML applications, most of XML queries have the same structure and results in most of the same query trees are processed.Also, in order to correctly count the frequencies of XML candidate query sub trees, the treejoin process or database scans are executed in their mining process.As a result, Bei et al. (2009) and Yang et al. (2008) still follow the traditional idea of generate-andtest paradigm, for XML query pattern mining and may not be suitable for ebXML applications.
This paper presents a novel algorithm, ebX 2 Miner, to mine the frequent XML query patterns for ebXML applications in e-commerce.ebX 2 Miner has the following advantages over the existing approaches.First, ebX 2 Miner focuses on the characteristic (that is, most of XML queries have the same structure) of ebXML applications and thus discovers the frequent XML query patterns with at most one database scan in the mining process.Although the existing algorithms could efficiently mine the frequent query patterns by constructing a tree model, two database scans are nonetheless necessary in order to correctly count the frequencies of candidate sub trees, thus, downgrading the system performance.Second, ebX 2 Miner encodes an XML query tree and stores its nodes' codes to enhance the mining performance.The key concept in ebX 2 Miner is that the leaf nodes' codes of a user query tree can preserve the tree's structure information.This will greatly reduce the effort of exploring the search space and computing time.
The rest of this paper is organized as follows.Section 2 discusses the previous works related to ebXML applications and XML query pattern mining.Section 3 formalizes the XML frequent query pattern mining problem in this paper.Section 4 describes the details of ebX 2 Miner algorithm.Section 5 compares the ebX 2 Miner algorithm with other existing XML query pattern mining algorithms.Section 6 shows the results of the performance study, and Section 7 illustrates the conclusion and further work in this paper.

LITERATURE REVIEW
In this section, some related works are reviewed, including the papers of Bei et al. (2009), Bio (2003), Green et al. (2005), Kim (2002) and Yang et al. (2008) on the ebXML applications and frequent XML query pattern mining.
ebXML provides a modular suite of specifications that enables enterprises of any size and in any geographical location to conduct business over the Internet (Green et al., 2005;Kim, 2002).It purports to support the exchange and query of structured business documents between the applications of trading enterprises so as to support business processes within the trading partner organizations.Indeed, OASIS, one of the joint developers of ebXML, claims that ebXML takes advantage of cost effective Internet technology, is built on EDI experience with input from the EDI community.Therefore, by using ebXML over the Internet, an industry needs to define and collect its business processes, scenarios, and company business profiles, and makes them available through an industry ebXML registry (typically defined using UDDI).Then, structured business documents can be exchanged and queried between trading parties using the automated flow and sequence of interactions that ebXML prescribes.
Many new XML query pattern mining algorithms (Bei et al., 2009;Yang et al., 2008) have been proposed to discover the frequent XML query patterns.Yang et al. (2008) collect all of XML user queries to construct a global tree (T-GQPT) and then employ a rightmost expansion enumeration on the T-GQPT tree to generate XML candidate query sub trees.The main idea of rightmost expansion is that a query tree containing k nodes is generated by appending a new node to the right most path of a frequent sub tree containing (k-1) nodes.Thus, many infrequent k-node trees are not enumerated if their (k-1)-node sub trees are infrequent.In addition, to compute the frequency of each candidate query sub tree, Yang et al. (2008) scan the database only when the candidate is a single branch tree.Among these algorithms, Fast XMiner (Yang et al., 2003) is the most efficient since the frequency of a non-single branch tree can be computed by joining the ID list of its proper rooted sub trees.On the other hand, 2PXMiner (Yang et al., 2008) extends Fast XMiner to discover the frequent XML query patterns that contain sibling repetitions.In order to speed up the mining performance, 2PXMiner computes the upper bound frequencies of XML candidate query sub trees and uses the minimum support constraint to early prune the infrequent query sub trees.
The VBU XMiner algorithm (Bei et al., 2008;Bei et al., 2009) also maintain a tree-like data structure, the CGTG tree, to merge all of XML queries to discover the frequent XML query patterns.In Bei et al. (2008), all of XML candidate query sub trees are enumerated based on the CGTG tree, and in Bei et al. (2009), the candidates whose frequencies are bigger than the minimum support value are enumerated.Thus, in Bei et al. (2009), before generating the candidate sub trees, the infrequent nodes in the CGTG tree are pruned.Also, the nodes in the CGTG tree are joined with their ancestor nodes which have the same IDs.Therefore, VBU XMiner generate candidate sub trees directly from the CGTG tree without scanning the database.In sum, it discovers the frequent XML query patterns on the processed CGTG tree.Bei et al. (2008Bei et al. ( , 2009) ) and Yang et al. (2008) still follow the traditional idea of generate-and-test paradigm to mine the frequent XML query patterns and thus, have the following drawbacks for ebXML applications in ecommerce.First, they employ the rightmost expansion technique to enumerate all of XML candidate query sub trees on the global trees (that is, T-GQPT and CGTG tree).This approach merges all path and sub tree information of a user query tree in the global trees and thus requires unacceptable costs of tree-join process or database scan during the mining process.Second, a great deal of system space is used to process XML query trees in these algorithms and degrades their mining performance.Unlike Yang et al. (2008), Bei et al. (2009) accumulate the frequencies of XML candidate query sub trees directly from the CGTG tree by executing the treejoin process.Therefore, Bei et al. (2009) are more efficient than Yang et al. (2008).However, Yang et al. (2008) still cost a lot of system time to execute the treejoin process for merging the path and sub tree information to generate frequent XML query patterns on the CGTG tree.

Problem statement
In this section, the problem statement is given to be Chang 779 solved.It begins by defining the XML query trees, their corresponding rooted sub trees, XML query tree databases, and the frequent XML query trees.Definition 1 defines an XML query tree.Definition 2 illustrates a rooted sub tree of an XML query tree.Definition 3 describes an XML query tree database, while Definition 4 defines the problem in this paper.
Definition 1: An XML query can be modeled as an unordered tree T i = <N i , E i >, where N i is the node set, and E i is the edge set.Nodes n ∈ N i represent the elements, attributes, and string values in an XML query, and edges e ∈ E i represent the parent-child relationships denoted by "/".
Definition 2: Given an XML query tree T i = <N i , E i > and an XML query rooted sub tree considered to be the rooted subtree of T i iff there exists: (1) Root(t ij ) = Root(T i ), where Root(t ij ) and Root(T i ) are the functions which return the root nodes of t ij and T i respectively. (2 Definition 3: Given an XML tree database Definition 4: Given an XML tree database D and a minimum support value m ranging from (0, 1].The frequent XML query pattern mining problem is finding the set S of rooted subtrees t ij such that for each t ij in S, sup(t ij ) ≧ m holds, where sup(t ij ) is the equation: the number of t ij / the number of XML query trees in D.
Definition 1 defines an XML query as a tree.For example, Figure 1 shows an XML query tree T i of the query to retrieve the author elements that have the string value "john" and are descendants of book elements that have a child title element whose value is "XML".Definition 2 defines an XML query rooted subtree.It shows the rooted subtrees t ij of the query tree T i .These rooted subtrees have the same root as the T i and their edges belong to those of T i .Note that, in this paper, a rooted subtree t ij with k edge is called a k-edge t ij .As a result, subtrees (a) and (b) are 1-edge subtrees, (c), (d), and (e) are 2-edge subtrees, and (f) is a 3-edge subtree.
Definition 3 illustrates an XML tree database D which contains multiple XML query trees.Each query tree in database D represents a transaction associated with its transaction ID.For example, in Figure 2, the database D = <T 1 , T 2 , T 3 , T 4 , T 5 >, where T 1 , T 2 , T 3 , T 4 , and T 5 are the query trees and with their transaction IDs 1, 2, 3, 4, and 5 respectively.In addition, Definition 4 defines the frequent XML query pattern mining problem in this paper.

FREQUENT XML QUERY PATTERN MINING FOR ebXML APPLICATIONS
In this section, the study proposes an encoding scheme (namely XCode) to represent an XML tree with its corresponding query trees, a data structure (namely XList) to store the codes of XML nodes based on the XCode scheme, and a mining algorithm (namely ebX 2 Miner algorithm) based on XCode and XList to discover the frequent XML query patterns for ebXML applications in e-commerce.

An encoding scheme: XCode
XCode encodes the nodes of an XML tree in a xy coordinate system where xy is the coordinate of the twodimensional space.The following symbols T i , r, k, p, l, fc, and nc are used to represent the nodes in an XML tree.
Symbol T i represents an XML tree, r indicates the root node in T i , k represents a node in T i , p indicates the parent node of k, l represents the left sibling node of k, fc denotes the first child node of k, and nc represents the child node of k expect the first child fc.The encoding rules are described for the nodes in an XML tree T i and listed as follows: (1) For an XML tree T i , the root node r is set on the origin whose coordinates x and y are (0, 0).
(2) For any node k in the tree T i , if k is the fc node of its parent node p and p's coordinates are (x p , y p ), then k's coordinates are (x p +1, y p +1).
(3) For any node k in the tree T i , if k is the nc node of its parent node p and its left sibling node l has m descendant nodes with the coordinates (xl, yl), then k's coordinates are (xl+m, yl).
Note that, for simplify, hereafter, the coordinates of a node in an XML tree based on the XCode scheme are namely the xcode of a node.
Example 1.Consider the XML tree in Figure 1.Suppose that all of nodes in the tree are encoded by the rules of the proposed XCode scheme.The xcodes of these nodes are shown in Figure 3.According to Rule (1), the root node book in the XML tree in Figure 1 is set on the origin and its xcode is (0, 0).According to Rule (2), the nodes title, XML, author 1 , john, jane, 2000, head 1 , origins, and head 2 are the fc nodes of a node in the tree and their xcodes are (1, 1), (2, 2), (3, 2), (5, 3), (4, 3), (5, 2), (6, 2), (7, 3), and (8, 3) respectively.Also, by Rule (3), the nodes allauthor, year, chapter, author 2 , section 1 , and section 2 are the nc nodes of a node in the tree and their xcodes are (2, 1), (4, 1), (5, 1), (4, 2), (7, 2), and (9, 3) respectively.Derived from the XCode encoding rules, Lemmas 1, 2, 3 and 4 show the features of xcodes of an XML tree.Lemma 1 describes that an xcode reveals the level of a node in an XML tree, Lemmas 2 and 3 illustrate the relationship between two xcodes of nodes in an XML tree, and Lemma 4 illustrates that the values of xcode are bigger than or equal to 0.
Lemma 1 for any two nodes f 1 and f 2 in an XML tree T i with the xcodes (x 1 , y 1 ) and (x 2 , y 2 ) respectively, if node f 2 is a child node of f 1 , then y 2 = y 1 + 1.
Proof: If f 2 is the first child node of f 1 , according to Rule (2), the xcode (x 2 , y 2 ) of f 2 is equal to (x 1 +1, y 1 +1); otherwise, that is equal to (x s +m, y s ), where (x s , y s ) is the xcode of f 1 's first child node f s and f s has m descendant nodes.Thus, if f 2 is the first child node of f 1 , y 2 = y 1 +1.In addition, since y 2 = y s and y s = y 1 + 1 which result in y 2 = y s = y 1 +1 .As a result, y 2 = y 1 + 1. Lemma 2: For any node f in an XML tree T i , if f's xcode is (x, y), then the value of y is equal to the level l of the node f in T i .
Proof: We prove the lemma by showing that the value of y is equal to that of l.There are three cases, depending on whether node f is the root, fc, or nc node in T i .
Case 1: Suppose that node f is the root node in T i .According to Rule (1), the xcode of f is (0, 0).Thus, the value of y is equal to 0. Also, since f is the root node, f's level/ is equal to 0. As a result, the value of y is equal to that of l.
Case 2: Suppose that f is the fc node in T i .Since f is not the root node and with the level l, it has the ancestor nodes p 0 ,p 1 ,.., p l-1 , where p l-1 is f's parent node, p l-2 is p l- 1 's parent node,…, and p 0 is the root node.According to Rule (1), the xcode of p 0 is (0, 0).Thus, y p0 is equal to 0. Also, according to Lemma 2, p 1 's xcode y p1 = y p0 +1.Thus, y p1 = y p0 + 1 = 0 + 1 = 1.In consequence, p 2 's xcode y p2 = y p1 + 1 = 1 + 1 = 2. Therefore, p l-1 's xcode y pl- As a result, the value of y is equal to that of f's level l.
Case 3: Suppose that f is the nc node and thus has a sibling node fc in T i .According to Case 2, the fc's xcode y fc = l.In consequence, according to Rule (3), f's xcode y is equal to y fc .As a result, y = y fc = l and the value of y is equal to that of f's level l.Based on Case 1, Case2, and Case 3, we thus prove this lemma.
Case 1: Suppose that node f 2 is a child node of f 1 .If f 2 is the first child node of f 1 , according to Rule (2), the xcode (x 2 , y 2 ) of f 2 is equal to (x 1 +1, y 1 +1); otherwise, that is equal to (x s +m, y s ), where (x s , y s ) is the xcode of f 1 's first child node f s and f s has m descendant nodes.Thus, if f 2 is the first child node of f 1 , x 2 = x 1 + 1 and y 2 = y 1 +1 which result in x 2 > x 1 and y 2 > y 1 respectively.In addition, since x 2 = x s + m, y 2 = y s , x s = x 1 + 1, and y s = y 1 + 1 which result in x 2 >= x s > x 1 and y 2 > y s > y 1.As a result, x 2 > x 1 and y 2 > y 1.
Case 2: Suppose that node f 2 is not a child node of f 1 and has a parent node f a which is a child node of f 1 .According to Case 1, node f a 's xcode x fa > x f1 and y fa > y f1 .Also, since f 2 's xcode x f2 > x fa and y f2 > y fa , they result x f2 > x f1 and y f2 > y f1 .
Based on Case 1 and Case2, we thus prove this lemma.
Lemma 4: For any node f in an XML tree T i , the values in f's xcode (x , y) are bigger than or equal to 0.
Proof: There are three cases, depending on whether node f is the root, fc, or nc node in T i .
Case 1: Suppose that node f is the root node in T i .According to Rule (1), f's xcode (x, y) is (0, 0).As a result, the values in f's xcode (x, y) are equal to 0.
Case 2: Suppose that f is the fc node and f has ancestor nodes p 0 ,p 1 ,.., p n in T i , where p n is f's parent node, p n-1 is p n 's parent node,…, and p 0 is the root node.According to Case 1, the values of p 0 's xcode are equal to 0. Also, according to Rules (2) or (3), the values of p 1 's xcode are the sum of those of p 0 's xcode with 1 or the number of descendant nodes of its sibling node.Therefore, the values of p 1 's xcodes are bigger than 0. In consequence, according to Rules (2) or ( 3), the values of the xcodes in p 2, p 3, …, p n are thus bigger than 0. Since, according to Rule (2), the values in f's xcode are the sum of those of p n 's xcode with 1.As a result, the values in f's xcodes are bigger than 0.
Case 3: Suppose that f is the nc node and thus has a sibling node fc in T i .According to Case 2, the values of fc's xcode are bigger than 0. In consequence, according to Rule (3), the values in f's xcode are the sum of those of fc's xcode with 1 or the number of fc's descendant nodes.As a result, the values in f's xcode are bigger than 0.
Based on Case 1, Case 2, and Case 3, the study proves this lemma.

XList
In this subsection, the data structure XList that plays an important role in the design of our mining algorithm is described.XList is designed to record the xcodes of nodes in XML query trees.In order to store an XML node, in XList, a new node (namely xNode) with two variables and two pointers is created.Figure 4 (a,b) presents an XML node to be stored in an xNode of XList.Variable code is used to store an XML node's xcode, and variable count is used to store the number of occurrences of the XML node of a user query tree in a database.Also, two pointers parent and sibling are used to link the XML node's parent and sibling nodes respectively.Furthermore, the sibling pointer has a variable s-count to record the number of occurrences of the relationships between two XML nodes.For example, the title node is shown in the query trees T 1 , T 2 , and T 3 in the database D. Through the XCode scheme, the xcode of the title node is (1, 1) and it can be stored in an xNode of XList; the title node's parent and sibling nodes are the book and allauthor nodes and linked by its parent and sibling pointers respectively.The xcodes of nodes book and allauthor are (0, 0) and (2, 1) respectively, while the numbers of occurrences of those nodes are 5 and 3 respectively.In addition, the s-count variable between the title and allauthor nodes is 2.
In the mining scheme, XList is constructed to store the nodes of XML query trees including their xcodes and the number of their occurrences in an XML query tree database.Construction of the XList consists of two steps.In the first step, the path information of an XML query tree is concerned (that is, the XL-Path algorithm), while in the second step, the subtree information of an XML query tree is considered (that is, the XL-Subtree algorithm).In the XL-Path algorithm, the leaf nodes of XML query trees are concerned to record the path information of an XML query tree.If no xNode exists in XList, these leaf nodes are stored in the new created xNodes of XList; otherwise, their xcodes are compared with the variables code of the existing xNodes.On the other hand, in the XL-Subtree algorithm, the relationship of a pair of leaf nodes of XML query trees is considered to deal with the subtree information of an XML query tree.If the relationship is not recorded in XList, the sibling pointers of xNodes are used; otherwise, the number of their occurrences is recorded in the existing variables s-count.The following symbols T i , l i , (l x , l y ), t i , a i , n i , and d i are used in the XL-Path and XL-Subtree algorithms to represent how to record the information of XML query trees in XList.Symbol T i represents an XML query tree, l i indicates a leaf node of T i , and (x l , y l ) denotes the xcode of l i .On the other hand, for the data structure XList, symbol n i represents a new created xNode, t i represents the xNodes which are not lined by any parent pointer of an xNode, a i indicates an ancestor node of t i , and d i shows a descendant node of an xNode.
Lines 2-5 store all of T i 's leaf nodes into the new created xNodes since there is no xNode in XList.Lines 7-28 compare the xcode (l x , l y ) with the variable code of t i in XList.Line 10 adds the value 1 to the variables count of t i and all of t i 's ancestor nodes a i since t i 's code is the same as the xcode of l i .Lines 13-15 store l i into a new created xNode n i and link t i 's parent pointer to n i since l i is an ancestor node of t i and t i has no ancestor node.Line 17 adds the value 1 to the variables of node a i and all of a i 's ancestors since a i is the same as l i .Lines 19-22 find an xNode a i which is a descendant node of l i , store l i into a new created xNode n i , and insert n i between a i and a i 's parent node.Lines 24-25 store l i into a new created xNode n i and link n i 's parent pointer to t i since l i is a descendant node of t i .Finally, Line 27 stores l i into a new created xNode n i since l i and t i have no ancestordescendant relationship (Figure 5).
For example, suppose that all of the query trees T 1 , T 2 , …, and T 5 are sequential read and processed by the XL-Path algorithm as shown in Figure 6.Firstly, T 1 is read and Lines 2-5 are executed since there is no xNode in XList.Thus, the leaf nodes XML and john of T 1 are stored in the new xNodes n 1 and n 2 of XList.Then, T 2 is read and Line 10 is executed since the leaf node XML of T 2 is the same as the xNode n 1 .Therefore, the value 1 is added into the variable count of n 1 and results.In consequence, T 3 is read and Lines 13-15 are executed since T 3 's leaf nodes title and allauthor are the ancestors of xNodes n 1 and n 2 respectively.Thus, two new xNodes n 3 and n 4 are created to store the two leaf nodes and xNodes n 1 and n 2 's parent pointers are linked to n 3 and n 4 respectively.Also, the values of variables count of n 3 and n 4 are set by the values 3 and 2 which are the sum of the value 1 and those values in variables count of n 1 and n 2 , respectively.After reading T 4 , Lines 2-5 are executed and the new xNode n 5 is thus created for T 4 's leaf node chapter.Finally, T 5 is read and Lines 24-25 are executed.The new xNodes n 6 , n 7 , and n 8 are created for T 5 's leaf node head 1 , head 2 , and section 2 .Also, the parent pointers of n 6 , n 7 , and n 8 is linked to n 5 .
In Figure 7, Line 3 links the sibling pointers between the two leaf nodes l i and l j 's corresponding xNodes n i and n j in XList.Lines 5-10 add the value 1 to the variables scount between xNodes n i and n j .
For example, suppose that all of query trees T 1 , T 2 , …, and T 5 are sequential read and processed by the XL-Subtree algorithm as shown in Figure 7. Firstly, T 1 is read and Lines 3-8 are executed since the relationship between the leaf nodes XML and john are not recorded in their corresponding xNodes n 1 and n 2 .Thus, the sibling pointer of n 1 is linked to n 2 and the variable s-count is set to the value 1.Then, T 2 is read and is not processed since it has no a pair of leaf nodes.In consequence, T 3 is read and Lines 7-8 are executed since T 2 's leaf nodes title and allauthor are the ancestors of xNodes n 1 and n 2 respectively.Thus, the sibling pointer between xNodes n 3 and n 4 are created.Also, the value of variable s-count is set by the sum of value 1 and the value of d i 's s-count.In addition, T 4 is read and not to be processed since it has no a pair of leaf nodes.Finally, T 5 is read and then Lines 3-5 are executed to show the result in Figure 8.
In Figure 9, firstly, all of XML user query trees in D are read and encoded by the proposed scheme XCode to construct XList.This step is done by the algorithms XL-Path and XL-Subtree.Secondly, the study prunes the infrequent query trees in XList by executing Lines 6-13.Finally, the study enumerates the frequent XML query pattern from XList by executing Lines 14-26.
For example, suppose that the database D has five query trees T 1 , T 2 , …, and T 5 and the value of m is 0.4.Firstly, after executing Lines 2-5, the content of XList is shown.Then, Figure 10 shows the results after executing Lines 6-13.Finally, sets fp and fs after executing Lines 14-26 are shown.

Comparing with VBUXMiner
ebX 2 Miner is more suitable for ebXML applications in ecommerce than the VBUXMiner algorithm.First, most of XML queries in ebXML applications have the same data structure.However, the VBUXMiner algorithm does not consider the characteristic of the XML queries in ebXML applications and thus merges all of queries into the CGTG tree.Therefore, to obtain the frequent XML query trees, the incomplete information of an XML query tree on the CGTG tree is collected by executing the tree-join process.In contrast, ebX 2 Miner considers the characteristic of ebXML applications and thus encodes the nodes of XML user query trees.As a result, the path and (2, 2) 2 n 1 (XML) (5, 3) 1 n 2 (john) (1, 1) 3 n 3 (title) (2, 1) 2 n 4 (allauthor) (5, 1) 2 n 5 (chapter)   subtree information of an XML query tree are preserved in the leaf nodes' codes and the tree-join process for producing the frequent query trees can be ignored.For example, the query trees are merged by the VBUXMiner algorithm and result in the CGTG tree as shown in Figure 11.In Figure 11, the incomplete information of a frequent XML query tree is shown and results in the VBUXMiner algorithm to execute the tree-join process or database scans.However, the complete information (that is, path and subtree) of a frequent query tree is preserved by the XCode and XList schemes in ebX 2 Miner.Therefore, the tree joining process and database scans cannot be used in ebX 2 Miner for generating frequent XML query trees.

Comparing with XQPMiner, XQPMinerTID, and 2PXMiner
One reason confirms that ebX 2 Miner may outperform XQPMiner, XQPMinerTID, and 2PXMiner.XQPMiner, XQPMinerTID, and 2PXMiner construct the T-GQPT tree to summarize all of query trees in database D and then generate all of single branch candidate subtrees from the T-GQPT tree.Through tree joining process (that is, constructing data structure ECTree), the single branch candidate subtrees are merged to produce the frequent query trees.Therefore, for ebXML applications, more XML query trees are processed on the T-GQPT tree and thus cost a lot of time to produce frequent XML query trees.In contrast, ebX 2 Miner encodes the nodes of an XML query tree and thus preserves the path and subtree information of the query tree in the system to reduce time and space costs.

PERFORMANCE STUDY
Two experiments are performed to illustrate the performance under ebX 2 Miner and VBUXMiner algorithms.Parameters and their settings in the simulation are listed in Table 1.The parameter n denotes the number of XML query trees in the database D, while the parameter s (2, 2) 2 n 1 (XML) (1, 1) 3 n 3 (title) (2, 1) 2 n 4 (allauthor) (5, 1) 2 n 5 (chapter) 2 head 1 [5] section 1 [5] head 2 section 2 [5] [5] represents the value of minimum support in the system.The first experiment (Figures 12 and 13) observes the execution time and memory space (Y-axis) of these algorithms under different number of XML query trees (Xaxis).The memory space used in ebX 2 Miner and VBUXMiner is measured by their created nodes in XList and CGTG tree respectively.The specified minimum support s is set to be 5%.ebX 2 Miner outperforms VBUXMiner on the execution time.Both curves for VBUXMiner and ebX 2 Miner increase as the number of XML query trees increases.Obviously, ebX 2 Miner changes slightly as the number of XML query trees increases.In contrast, VBUXMiner changes heavy.One reason could be the high efficiency and stability of the ebX 2 Miner.VBUXMiner does not consider the path and subtree of XML user query trees in its CGTG tree.Thus, the tree-joining process and database scans are executed to combine this information.As a result, more execution time is used in VBUXMiner for generating the frequent XML query patterns.This is consistent with the experimental result.The used nodes generated from ebX 2 Miner in XList are less than those from VBUXMinr in CGTG tree.A possible reason is that the XCode scheme encodes the path and subtree information in the nodes of XList and results in a few XML nodes in query trees stored in XList.
The second experiment (Figure 14) observes the execution time (Y-axis) of ebX 2 Miner and VBUXMiner under different minimum supports (X-axis).The specified number of XML query trees is set to 30000.ebX 2 Miner outperforms VBUXMiner on the execution time.Both curves for VBUXMiner and ebX 2 Miner change slightly as the specified minimum support increases.A possible reason is that when the specified minimum support increases, most of the candidate subtrees of ebX 2 Miner and VBUXMiner are produced from XList and CGTG tree  respectively.The execution time of ebX 2 Miner is less than that of VBUXMiner.The reason is that VBUXMiner cost a lot of time to execute the tree-joining process to produce the frequent XML query patterns.
The two experiments as mentioned above show that ebX 2 Miner has higher mining performance than VBUXMiner.This is because by XCode and XList schemes, the path and subtree information are preserve in the leaf nodes of query trees and result in less space and time cost in the ebX 2 Miner.

Conclusion
This paper presents an efficient mining algorithm ebX 2 Miner to discover frequent XML query patterns.Unlike the existing algorithms, the study proposes a new idea by encoding XML user query trees (that is, XCode) and thus, stores these codes (that is, XList) to preserve the path and subtree information of query trees.With this idea, it becomes obvious that ebX 2 Miner is not capable of maintaining all of the user queries and thus takes less execution time and memory space to produce frequent XML query patterns for ebXML applications.The future work in this study includes expanding XML query patterns with repeating-siblings, since ebX 2 Miner cannot mine the frequent XML query patterns with sibling repetitions.

Figure 1 .
Figure 1.The rooted subtrees of the XML query tree.

Figure 2 .
Figure 2.The XML query trees in the database D.

Figure 3 .
Figure 3.The xcodes of the nodes in the XML tree in Figure 2(a).

Figure 4 .
Figure 4.The structures and contents of xNodes in XList.
an example of the XML nodes stored in xNodes in XList

Figure 6 .
Figure 6.The XList for the XML query trees in Figure 4 after executing the XL-Path algorithm.

Figure 8 .
Figure 8.The XList for the XML query trees in Figure 4 after executing the XL-Subtree algorithm.

Figure 10 .
Figure 10.The frequent query patterns for the XML query trees.

Figure 11 .
Figure 11.The CGTG tree of the query trees in database D.

Figure 13 .
Figure 13.The enumerated nodes with varying number of XML query trees.

Figure 14 .
Figure 14.The execution time with varying minimum supports.
xcode (lx, ly) with the variable code of each ti in XList 9 if xcode (l x , l y ) is the same with t i 's code then 10 add value 1 to the count variables of t i and all of t i 's ancestor nodes a i i is the ancestor node of t i and t i has no ancestor node a i 13 store the node l i into a new created xnode n i 14 link the parent pointer of t i to n i 15 set the value of variable count of n i is the sum of that of t i with 1 16 if l i is an ancestor of t i and t i has an ancestor a i which is the same as l i 17 add value 1 to the variable count of ai and all of ai's ancestor nodes 18 if li is an ancestor of ti and all of ti's ancestor ai are different from li 19 find the xnode ai which is a descendant node of li 20 store node li into a new created xnode ni 21 link the parent pointer of ni to ai's parent pointer 22 link the parent pointer of ai to ni 23 if li is a descendant node of ti 24 store node li into a new created xnode ni

Table 1 .
Simulation parameters and settings.
Figure 12.The execution time with varying number of XML query trees.