XML query processing algorithms

In February 1998, XML 1.0 has been introduced by the W3C to be the new recommendation for data exchange on internet. At this time, the most simple querying solution was to store XML data into relational databases and to convert XML queries into SQL queries. Storing and querying became an important research domain because this solution leads to poor performances. This issue has been addressed by several research papers. The most famous is the TwigStack algorithm.
The first section makes an overview of the current solution for XML documents storage into well-known relational databases. The progress to TwigStack algorithm will be shown in the second section through two previously published papers on which TS is based.

Commercial relational database management systems §

One of the first ideas when it comes to XML query processing is to store XML documents into relational database system and convert XPath queries into SQL queries. Using this solution, twenty years of work on RDBMS query optimization, query execution, scalability, concurrency control and recovery immediately extend to XML query processing. Commercial products such as Oracle, IBM DB2 and Microsoft SQL Server all provides specific features to store and query XML data.

RDMS	Oracle 11g	IBM DB2 9	Microsoft SQL Server 2008
Large object storage data type suitable for XML documents	XMLTYPE¹ store AS CLOB	XMLCLOB² or XMLVARCHAR³ or XMLFILE⁴	[n]varchar(max) or varbinary(max)
Automatique mapping/shredding XML into relational storage	XMLTYPE¹ store AS OBJECT RELATIONAL	XML Collections	XML View⁵
Native XML storage data type	XMLTYPE¹ store AS BINARY XML	xml (pureXML)	xml

Storage options of an XML document into relational database management system

¹XMLType cannot exceed 4GB.
²XMLCLOB cannot exceed 2Gb.
³XMLVARCHAR cannot exceed 32kb.
⁴More flexible but does not benefit from database-managed persistency and integrity.
⁵Schema cannot be recursive or the maximum recursion depth is known.

Most of the commercial relational database vendors claimed XML support in their products through XML extenders as soon as XML was released. Two storage options were available: plain-text storage as simple text string or shredding/mapping into standard relational tables. As it can be noticed in the table above, the three main vendors all finally included a native XML storage in the latest version of their product since XML data does not fit that well in relational databases: PureXML is a new feature of IBM DB2 9 [DB2-XML], (native) xml storage is a new feature of Microsoft SQL Server 2005 [SQLSRV-XML], BINARY XML storage is a new feature of Oracle 11g [GRALIKE10].

The main difference between relational and XML data model is that the first is structured and the second is semi-structured or unstructured. Relational data model is suitable for the storage of highly structured data having already well-known schema at database design. If the structure of the data is not known or if it may change significantly in the future, XML data model offers more flexibility. Due to its tree structure, XML data model excels at representing containment hierarchies (especially recursive ones), finally XML data model allows the creation of queries based on the structure of the data (rather than its value).

Even though XML solves many of the problems by providing a standard format for data interchange, there are other problems, such as storing the XML documents in a centralized repository, as well as the ability to quickly search for information or to trigger automatic data change when a particular action occurs. These kinds of issues can be addresses only by a database management system (DBMS). Those systems have a legitimacy for XML documents too, but they have to process them in a different way than relational tables. This is why some research in that field was required.

Research achievements on XML query processing §

This section will introduce the evolution of XML query processing research through three papers and their respective algorithms. XML data preprocessing is presented first when explaining the first article. It will not be repeated since this task is the same and is required for the three algorithms. This task is also required for the GPU version explained in following chapters.

fig Research progress on XML query processing

This figure introduces the main improvement made by each article based on the work of the previous. Numerous successors exist to the last article of this figure, but do not introduce major changes.

MPMGJN algorithm §

Utilization of the Multi Predicate MerGe JoiN algorithm for XML query is explained in the following research paper:

From the observation that the inverted lists of Information Retrieval engines is well-suited to XML queries, this paper compares two possibilities of querying architecture for RDBMS: separated "loosely-coupled" IR engine or native tables and SQL queries. Because of the several advantages of native relational database storage and querying, the authors purpose was to achieve same or better performances using the second option.
The article shows that join algorithms and hardware cache utilization are not efficient when processing XML queries from relational tables.

The author predicted that a substantial amount of XML documents would be stored in relational databases in the future. Nine years later, major DBMS vendors included a native XML storage option in their product. In retrospect, the good idea of this article was not the join algorithm, but the efficient XML representation as an extended inverted index.

XML data preprocessing §

The following simple example is a well-formed and valid XML document. The first line is the prolog of the document (a kind of header), the content goes from the second line to the end line. There are three kinds of token: element (opening and closing), attribute and text. The most common way to represent XML content is the tree data structure.

An XML document can be formalized as a set of a vertice set, an edge set and a root vertice: (V, E, r) where V = {v₁, …, v_n} is the set of nodes which contains elements (XML tag, mandatory), strings (optional) and attributes (tag attributes, optional), E = {(v_j, v_k)} is the set of edges between two tree nodes and r ∈ V is the root node.

In order to simply the process of attributes, an XML document was formalized as a set of element node set, attribute node set, string node set, edge set and root element node: (El, At, St, Ed, r). El = {el₁, …, el_n}, At = {at₁, …, at_m} and St = {st₁, …, st_p}. Ed = {(x, y)} where (x, y) ∉ {(at_j, at_k), (at_j, el_k), (st_j, st_k), (st_j, at_k), (st_j, el_k)}.

XML document is not kept in its original tree structure from DOM parsers. The underlying structure is in the form of streams. Streams are sequences of XML nodes represented in a 3-ary tuple representation: document number, left and right positions and depth. The following example is still based on the same XML document and also feature a path column that is not mandatory.
The classic inverted index data structure maps words or phrases only. In order to store XML documents, it can be extended into: an element index (E-index), an attribute index (A-index) and a term index (T-index).

value	document number	left position	right position	depth	path
E-index (XML elements)
library	1	1	17	0	/
category	1	2	16	1	/library/
book	1	6	15	2	/library/category/
title	1	7	14	3	/library/category/book/
A-index (attributes of XML elements)
name	1	3	5	2	/library/category/
language	1	8	10	4	/library/category/book/title/
T-index (text words nested in XML elements)
France	1	4	4	3	/library/category/name/
English	1	9	9	5	/library/category/book/title/language/
The	1	11	11	4	/library/category/book/title/
Little	1	12	12	4	/library/category/book/title/
Prince	1	13	13	4	/library/category/book/title/

Example of the extended inverted index of XML data

This inverted indexes representation was chosen because it improves the discovery of containment properties (or structural relationship) on which XML queries are based. Paper use different terms. Containment properties between two nodes of the XML tree can be "ascendant-descendant" (indirect containment) or "parent-child" relationships (direct containment). I find the term between quotes easier to understand than the one between parentheses therefore I will use them.

Using the notation of the table example, "A is a descendant of B" is equal to this condition doc_no_A = doc_no_B AND left_pos_A > left_pos_B AND right_pos_A < right_pos_B. This condition matches all descendants. If only children have to be match, it only requires to append depth_A = depth_B + 1 to the previous condition. Because of the strict nesting structure of XML, right_pos_A < right_pos_B can be omitted for child relationship.
A worth noting point about this representation of XML document is that checking an "ascendant-descendant" relationship is as easy as checking "parent-child" relationship. This is the main advantage of this representation over the tree representation.

Query parsing §

An XML query can be seen as a set of structural relationships. Using XPath abbreviated syntax, '/' and '//' (abbr. of /descendant-or-self::node()/) represent parent-child and ascendant-descendant relationships.

fig Example of XPath query string to query paths conversion.
Please note that the tree representation of this figure always feature "parent-child" relationship between nodes.

The query of the figure means "all English titles of books in categories named France of the library". XPath query string is parsed in order to create a tree representation of it. The brackets allows to create several conditions in the query which means several branches in the tree representation. This query tree can also be viewed as a set of query paths.
Brackets can also have another role. Please compare the following queries:

These two XPath queries do produce the same query tree and the same query paths set. The difference is only at display level. First query will output XML data between matching <title> while second query will output all data between <library> which contain a matching title (if library contains many matching titles, the whole library will be output as many times as the number of matching titles).

Algorithm §

The MPMGJN algorithm is a variant of the well-known inverted list algorithm of Information Retrieval. For example, if the query to proceed is B//"A", the inverted lists of the element B and the term A are retrieved into List1 and List2 (function input).

 1 # List1: Outer list, List2: Inner list
 2 function containmentMerge(List1, List2)
 3 	set cursor1 at beginning of List1
 4 	set cursor2 at beginning of List2
 5 	while cursor1 != “end of List1” and cursor2 != “end of List2”
 6 		# join only on the same document
 7 		if cursor1.docno < cursor2.docno
 8 			cursor1++
 9 		else if cursor2.docno < cursor1.docno
10 			cursor2++
11 		else
12 			# mark contains the start of the record scan. mark = cursor2 since no scan yet
13 			mark = cursor2
14 			while cursor2.position < cursor1.position and cursor2 != “end of List2”
15 				cursor2++
16 				# if no start record found before end of list
17 				if cursor2 == “end of List2”
18 					cursor1++
19 					cursor2 = mark
20 				# if start record found mark will remember this start while scanning
21 				else if cursor1.val contains cursor2.val
22 					mark = cursor2
23 					do
24 						# ~output
25 						merge cursor1 and cursor2 values
26 						cursor2++
27 					# stop scanning if join no more possible ("stop record")
28 					while cursor1.val contains cursor2.val and cursor2 != “end of List2”
29 					# next outer will be proceeded
30 					cursor1++
31 					# will restart at "start record" on next seek
32 					cursor2 = mark
33 				# end of scan
34 				end if
35 			# end of seek
36 			end while
37 		# end of document search
38 		end if
39 	# end of join merge
40 	end while
41 end

The inverted list containment merging algorithm with comments

When applied to B and A streams, the algorithm above performs a join operation like the following SQL query.

SELECT *
FROM elements e, texts t
WHERE e.value = 'B'
AND t.value = 'A'
AND e.doc_no = t.doc_no
AND e.left_pos < t.left_pos
AND e.right_pos > t.right_pos

Example of SQL query for B//"A"

The RDBMS query plan generator can choose between index nested-loop join and merge join algorithms to process the SQL query. The first choice is very close to the MPMGJN algorithm: inner rows are selectively examined using start and stop keys in both, but it still suffer from binary trees utilization. Binary search is efficient but performs unpredictable memory access. The improvement of MPMGJN is that the seeks are made serial in comparison with standard index nested loop, thus it has a better hardware cache usage and lower cache miss rate.
Despite its good cache usage, MPMGJN algorithm still has a big drawback: the worst case is quadratic.

Stack-Tree algorithm §

This paper takes advantage of the inverted list representation of XML data while introducing a novel stack-featured algorithm. Stack-tree algorithm achieves linear worst-case complexity while MPMGJN algorithm was quadratic.

 1 # AList: list of potential ancestors sorted
 2 # DList: list of potential descendants sorted
 3 function stackTreeDesc(AList, DList)
 4 	a = AList.firstNode
 5 	d = DList.firstNode
 6 	outputList = NULL
 7 	while “the input lists are not empty” or “the stack is not empty”
 8 		if a.StartPos > stack.top.EndPos and d.StartPos > stack.top.EndPos
 9 		# cannot contain any d or a: not a solution -> remove
10 		# example: </stackTopElem> ... <nextA> ... <nextD>
11 			tuple = stack.pop()
12 		else if a.StartPos < d.StartPos
13 		# ...and a could be a descendant of top stack element if it isn't going
14 		# to be closed before d starts
15 			stack.push(a)
16 			a = a.nextNode
17 		# (a.StartPos > d.StartPos) but top stack element contains d
18 		# therefore d can't be nested more deeper -> return stack as result
19 		else
20 			for a1=stack.bottom; a1!=NULL; a1=a1.up
21 				append(a1,d) to outputList
22 			end for
23 			# d completed
24 			d = d.nextNode
25 		end if
26 	end while
27 end

Stack-Tree algorithm with comments

This join algorithm provides for a more efficient set-at-a-time strategy performing no unnecessary comparisons while MPMG join algorithm used a node-at-a-time strategy (especially for parent-child relationship).

fig MPMGJN will perform
unnecessary comparisons
while Stack-tree will not.

The problem of the previous algorithm is solved since Stack-tree algorithm will not restart after examining d1, but will use the stack to go back later (for d2).
Thanks to its stack, Stack-tree algorithm can guarantee a linear worst-case complexity (in both CPU and I/O).
The paper presents an efficient join algorithm for binary relationships, but complex queries contains several binary relationships. Finding the optimal join ordering was outside the scope of this paper and will be addressed by the next one.

Holistic Twig Joins algorithms §

Twig Joins is a family of algorithms for processing XML query patterns. They are refered as holistic because they allow to match structural relationships holistically (i.e., as a whole), thus reducing the number of not required temporary results. This strategy is opposed to previously presented join algorithms which only solved the problem of binary relationships while the join ordering of complex queries remained outside the scope of them.
Like Stack-tree, TwigStack algorithm uses a set-at-a-time strategy. The original TwigStack algorithm is explained in the following research paper and many other algorithms have followed such as Twig2Stack, TwigList or TwigMix.

Algorithm overview §

Like all algorithms presented in this section, TwigStack makes use of the extended inverted index representation of XML data. XML query string is also viewed as a tree, but unlike previous solutions, this tree is not divided into query root-to-leaf paths and binary relationships. The inverted list (refered as stream in the article) of each query node is retrieved and linked to the query tree.

fig TwigStack query processing algorithm overview

Algorithm execution is divided into two different steps: phase 1 and phase 2.
In the first phase, query node streams are compared in order to find root-to-leaf paths. In the second phase, those paths are merged to create full query tree again.
As said before, the algorithm does not just solve all root-to-leaf path matching a query path because it would lead to many intermediate results which may not be part of the final answer. The getNext() function ensures that when query twig pattern has only ancestor-descendant edges, each solution to each individual query root-to-leaf path is guaranteed to be merge-joinable with at least one solution to each of the other root-to-leaf paths. This function is key feature of the TwigStack algorithm.

 1 # q: query twig pattern (q = root of the tree)
 2 function twigStack(q)
 3 	# PHASE1
 4 	# End of first phase is reached when
 5 	# ∀qi ∈ subtreeNodes(q) : isLeaf(qi) ⇒ eof(Tqi)
 6 	while !end(q)
 7 		# getNext() call ensures that before a node hq from stream Tq
 8 		# is pushed on its stack:
 9 		# - hq has a descendent h_q_i in each of the streams Tqi.
10 		# - each of the nodes hqi recursively satisfies this too.
11 		q_act = getNext(q)
12 		if !isRoot(q_act)
13 			cleanStack(parent(q_act), nextL(q_act))
14 		end if
15 		if isRoot(q_act) or !empty(“stack of parent(q_act)”)
16 			cleanStack(q_act, nextL(q_act))
17 			moveStreamToStack(“stream of q_act”,
18 			                  “stack of q_act”,
19 			                  “pointer to top(“stack of parent(q_act)”)”)
20 			if isLeaf(q_act)
21 				showSolutionsWithBlocking(“stack of q_act”, 1)
22 				pop(“stack of q_act”)
23 			else
24 				advance(“stream of q_act”)
25 			end if
26 		end if
27 	end while
28 	# PHASE2
29 	mergeAllPathSolutions()
30 end

 1 function getNext(q)
 2 	if isLeaf(q)
 3 		return q
 4 	end if
 5 	foreach q_i in children(q)
 6 		# Recursive call
 7 		n[i] = getNext(q_i)
 8 		if n_i != q_i
 9 			return n_i
10 		end if
11 	end foreach
12 	# Please note that n is an array of query nodes (n=children of q)
13 	n_min = minNextL(n)
14 	n_max = maxNextL(n)
15 	while nextR(q) < nextL(n_max)
16 		advance(q)
17 	end while
18 	ifnextL(q) < nextL(n_min)
19 		return q
20 	else
21 		return n_min
22 	end if
23 end
24

1 # T_q: stream of q, S_q: stack of q, p: pointer to parent
2 function moveStreamToStack(T_q, S_q, p)
3 	push(S_q, couple(next(T_q), p))
4 	advance(“stream of q”)
5 end

1 # S: stack, actL: left position of the actual query node
2 function cleanStack(S, actL)
3 	while !empty(S) and topR(S) < actL
4 	pop(S)
5 end

Holistic TwigStack algorithm with comments

Phase 1 §

The holistic twig joins algorithm can perform multiple scans over stream inputs simultaneously while reducing redundant query root-to-leaf path solutions optimally and skipping stream nodes that do not contribute to the solutions. The function showSolutionWithBlocking() does not show the solutions actually (the name comes from the research article). The complete intermediate path solution(s) which are stored in compact stack encoding are built and appended to the list of intermediate path solutions found so far. The second phase will use this list as input to the merge-join process.

Phase 2 §

The second phase merge the intermediate root-to-leaf path solutions. Thanks to the blocking feature of the showSolutionsWithBlocking() function, path solutions are already sorted in root-to-leaf order. Since the input is sorted in order of the common prefix, the second phase is linear in the sum of its input and output. This is also a key feature of the efficiency of the TwigStack algorithm.

Compact stack encoding §

The compact stack encoding is built in the first phase. showSolutionsWithBlocking() is a kind of decoder. During its execution, TwigStack algorithm might find a huge amount partial intermediate path that match the query. If all of those partial paths were stored individually, this intermediate storage could be bigger than input and output storage sizes.
The following example shows how a set of root-to-leaf path solution are encoded using stacks. In this example, the query is //a//b//c. A stack belongs to each node of the query. This stack stores the positional information retrieved from the inverted list (stream) corresponding to the node and a pointer to the parent stack element in the query.