XML query processing algorithms

Vincent Jordan (KDE lab.)

In February 1998, XML 1.0 has been introduced by the W3C to be the new recommendation for data exchange on internet. At this time, the most simple querying solution was to store XML data into relational databases and to convert XML queries into SQL queries. Storing and querying became an important research domain because this solution leads to poor performances. This issue has been addressed by several research papers. The most famous is the TwigStack algorithm.
The first section makes an overview of the current solution for XML documents storage into well-known relational databases. The progress to TwigStack algorithm will be shown in the second section through two previously published papers on which TS is based.

Commercial relational database management systems §

One of the first ideas when it comes to XML query processing is to store XML documents into relational database system and convert XPath queries into SQL queries. Using this solution, twenty years of work on RDBMS query optimization, query execution, scalability, concurrency control and recovery immediately extend to XML query processing. Commercial products such as Oracle, IBM DB2 and Microsoft SQL Server all provides specific features to store and query XML data.

RDMSOracle 11gIBM DB2 9Microsoft SQL Server 2008
Large object storage data type suitable for XML documentsXMLTYPE1 store AS CLOBXMLCLOB2 or XMLVARCHAR3 or XMLFILE4[n]varchar(max) or varbinary(max)
Automatique mapping/shredding XML into relational storageXMLTYPE1 store AS OBJECT RELATIONALXML CollectionsXML View5
Native XML storage data typeXMLTYPE1 store AS BINARY XMLxml (pureXML)xml
Storage options of an XML document into relational database management system

1XMLType cannot exceed 4GB.
2XMLCLOB cannot exceed 2Gb.
3XMLVARCHAR cannot exceed 32kb.
4More flexible but does not benefit from database-managed persistency and integrity.
5Schema cannot be recursive or the maximum recursion depth is known.

Most of the commercial relational database vendors claimed XML support in their products through XML extenders as soon as XML was released. Two storage options were available: plain-text storage as simple text string or shredding/mapping into standard relational tables. As it can be noticed in the table above, the three main vendors all finally included a native XML storage in the latest version of their product since XML data does not fit that well in relational databases: PureXML is a new feature of IBM DB2 9 [DB2-XML], (native) xml storage is a new feature of Microsoft SQL Server 2005 [SQLSRV-XML], BINARY XML storage is a new feature of Oracle 11g [GRALIKE10].

The main difference between relational and XML data model is that the first is structured and the second is semi-structured or unstructured. Relational data model is suitable for the storage of highly structured data having already well-known schema at database design. If the structure of the data is not known or if it may change significantly in the future, XML data model offers more flexibility. Due to its tree structure, XML data model excels at representing containment hierarchies (especially recursive ones), finally XML data model allows the creation of queries based on the structure of the data (rather than its value).

Even though XML solves many of the problems by providing a standard format for data interchange, there are other problems, such as storing the XML documents in a centralized repository, as well as the ability to quickly search for information or to trigger automatic data change when a particular action occurs. These kinds of issues can be addresses only by a database management system (DBMS). Those systems have a legitimacy for XML documents too, but they have to process them in a different way than relational tables. This is why some research in that field was required.

Research achievements on XML query processing §

This section will introduce the evolution of XML query processing research through three papers and their respective algorithms. XML data preprocessing is presented first when explaining the first article. It will not be repeated since this task is the same and is required for the three algorithms. This task is also required for the GPU version explained in following chapters.

Research progress on XML query processing
fig Research progress on XML query processing

This figure introduces the main improvement made by each article based on the work of the previous. Numerous successors exist to the last article of this figure, but do not introduce major changes.

MPMGJN algorithm §

Utilization of the Multi Predicate MerGe JoiN algorithm for XML query is explained in the following research paper:

On Supporting Containment Queries in Relational Database Management Systems
Chun Zhang, Jeffrey F. Naughton, David J. DeWitt, Qiong Luo and Guy Lohman
SIGMOD 2001

From the observation that the inverted lists of Information Retrieval engines is well-suited to XML queries, this paper compares two possibilities of querying architecture for RDBMS: separated "loosely-coupled" IR engine or native tables and SQL queries. Because of the several advantages of native relational database storage and querying, the authors purpose was to achieve same or better performances using the second option.
The article shows that join algorithms and hardware cache utilization are not efficient when processing XML queries from relational tables.

The author predicted that a substantial amount of XML documents would be stored in relational databases in the future. Nine years later, major DBMS vendors included a native XML storage option in their product. In retrospect, the good idea of this article was not the join algorithm, but the efficient XML representation as an extended inverted index.

XML data preprocessing §

The following simple example is a well-formed and valid XML document. The first line is the prolog of the document (a kind of header), the content goes from the second line to the end line. There are three kinds of token: element (opening and closing), attribute and text. The most common way to represent XML content is the tree data structure.

<?xml version="1.0" encoding="UTF-8" ?>
<library>
	<category name="France">
		<book>
			<title language="English">The Little Prince</title>
		</book>
	</category>
</library>
Example of XML document viewed as plain text
XML data  viewed as a tree
fig Same XML document viewed as a tree

An XML document can be formalized as a set of a vertice set, an edge set and a root vertice: (V, E, r) where V = {v1, …, vn} is the set of nodes which contains elements (XML tag, mandatory), strings (optional) and attributes (tag attributes, optional), E = {(vj, vk)} is the set of edges between two tree nodes and r ∈ V is the root node.

In order to simply the process of attributes, an XML document was formalized as a set of element node set, attribute node set, string node set, edge set and root element node: (El, At, St, Ed, r). El = {el1, …, eln}, At = {at1, …, atm} and St = {st1, …, stp}. Ed = {(x, y)} where (x, y) ∉ {(atj, atk), (atj, elk), (stj, stk), (stj, atk), (stj, elk)}.

<?xml version="1.0" encoding="UTF-8" ?>
<library>
	<category>
		<@name>France</@name>
		<book>
			<title>
				<@language>English</@language>
				The Little Prince
			</title>
		</book>
	</category>
</library>
Example of XML document without element attributes
XML data  viewed as a tree
fig Same XML document viewed as a tree

XML document is not kept in its original tree structure from DOM parsers. The underlying structure is in the form of streams. Streams are sequences of XML nodes represented in a 3-ary tuple representation: document number, left and right positions and depth. The following example is still based on the same XML document and also feature a path column that is not mandatory.
The classic inverted index data structure maps words or phrases only. In order to store XML documents, it can be extended into: an element index (E-index), an attribute index (A-index) and a term index (T-index).

value document
number
left
position
right
position
depth path
E-index (XML elements)
library 1 1 17 0 /
category 1 2 16 1 /library/
book 1 6 15 2 /library/category/
title 1 7 14 3 /library/category/book/
A-index (attributes of XML elements)
name 1 3 5 2 /library/category/
language 1 8 10 4 /library/category/book/title/
T-index (text words nested in XML elements)
France 1 4 4 3 /library/category/name/
English 1 9 9 5 /library/category/book/title/language/
The 1 11 11 4 /library/category/book/title/
Little 1 12 12 4 /library/category/book/title/
Prince 1 13 13 4 /library/category/book/title/
Example of the extended inverted index of XML data

This inverted indexes representation was chosen because it improves the discovery of containment properties (or structural relationship) on which XML queries are based. Paper use different terms. Containment properties between two nodes of the XML tree can be "ascendant-descendant" (indirect containment) or "parent-child" relationships (direct containment). I find the term between quotes easier to understand than the one between parentheses therefore I will use them.

Using the notation of the table example, "A is a descendant of B" is equal to this condition doc_noA = doc_noB AND left_posA > left_posB AND right_posA < right_posB. This condition matches all descendants. If only children have to be match, it only requires to append depthA = depthB + 1 to the previous condition. Because of the strict nesting structure of XML, right_posA < right_posB can be omitted for child relationship.
A worth noting point about this representation of XML document is that checking an "ascendant-descendant" relationship is as easy as checking "parent-child" relationship. This is the main advantage of this representation over the tree representation.

Query parsing §

An XML query can be seen as a set of structural relationships. Using XPath abbreviated syntax, '/' and '//' (abbr. of /descendant-or-self::node()/) represent parent-child and ascendant-descendant relationships.

from XPath query string to query paths
fig Example of XPath query string to query paths conversion.
Please note that the tree representation of this figure always feature "parent-child" relationship between nodes.

The query of the figure means "all English titles of books in categories named France of the library". XPath query string is parsed in order to create a tree representation of it. The brackets allows to create several conditions in the query which means several branches in the tree representation. This query tree can also be viewed as a set of query paths.
Brackets can also have another role. Please compare the following queries:

  1. /library/category[@name=France]/book/title[@language=English]
  2. /library[/category[@name=France]][/book/title[@language=English]]

These two XPath queries do produce the same query tree and the same query paths set. The difference is only at display level. First query will output XML data between matching <title> while second query will output all data between <library> which contain a matching title (if library contains many matching titles, the whole library will be output as many times as the number of matching titles).

Algorithm §

The MPMGJN algorithm is a variant of the well-known inverted list algorithm of Information Retrieval. For example, if the query to proceed is B//"A", the inverted lists of the element B and the term A are retrieved into List1 and List2 (function input).

 1 # List1: Outer list, List2: Inner list
 2 function containmentMerge(List1, List2)
 3 	set cursor1 at beginning of List1
 4 	set cursor2 at beginning of List2
 5 	while cursor1 != “end of List1” and cursor2 != “end of List2”
 6 		# join only on the same document
 7 		if cursor1.docno < cursor2.docno
 8 			cursor1++
 9 		else if cursor2.docno < cursor1.docno
10 			cursor2++
11 		else
12 			# mark contains the start of the record scan. mark = cursor2 since no scan yet
13 			mark = cursor2
14 			while cursor2.position < cursor1.position and cursor2 != “end of List2”
15 				cursor2++
16 				# if no start record found before end of list
17 				if cursor2 == “end of List2”
18 					cursor1++
19 					cursor2 = mark
20 				# if start record found mark will remember this start while scanning
21 				else if cursor1.val contains cursor2.val
22 					mark = cursor2
23 					do
24 						# ~output
25 						merge cursor1 and cursor2 values
26 						cursor2++
27 					# stop scanning if join no more possible ("stop record")
28 					while cursor1.val contains cursor2.val and cursor2 != “end of List2”
29 					# next outer will be proceeded
30 					cursor1++
31 					# will restart at "start record" on next seek
32 					cursor2 = mark
33 				# end of scan
34 				end if
35 			# end of seek
36 			end while
37 		# end of document search
38 		end if
39 	# end of join merge
40 	end while
41 end
The inverted list containment merging algorithm with comments

When applied to B and A streams, the algorithm above performs a join operation like the following SQL query.

SELECT *
FROM elements e, texts t
WHERE e.value = 'B'
AND t.value = 'A'
AND e.doc_no = t.doc_no
AND e.left_pos < t.left_pos
AND e.right_pos > t.right_pos
Example of SQL query for B//"A"

The RDBMS query plan generator can choose between index nested-loop join and merge join algorithms to process the SQL query. The first choice is very close to the MPMGJN algorithm: inner rows are selectively examined using start and stop keys in both, but it still suffer from binary trees utilization. Binary search is efficient but performs unpredictable memory access. The improvement of MPMGJN is that the seeks are made serial in comparison with standard index nested loop, thus it has a better hardware cache usage and lower cache miss rate.
Despite its good cache usage, MPMGJN algorithm still has a big drawback: the worst case is quadratic.

<?xml version="1.0" encoding="UTF-8" ?>
<a1>
	<d1 />
	<a2>
		<d2 />
		<a3>
			<d3 />
			<d4 />
		</a3>
	</a2>
</a1>
Example of MPMGJN worst case, viewed as plain text
MPMGJN worst case
fig Example of MPMGJN worst case, viewed as tree

Stack-Tree algorithm §

Stack-Tree algorithm is explained in the following research paper:

Structural Joins: A Primitive for Efficient XML Query Pattern Matching
Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava, Yuqing Wu
ICDE 2002

This paper takes advantage of the inverted list representation of XML data while introducing a novel stack-featured algorithm. Stack-tree algorithm achieves linear worst-case complexity while MPMGJN algorithm was quadratic.

 1 # AList: list of potential ancestors sorted
 2 # DList: list of potential descendants sorted
 3 function stackTreeDesc(AList, DList)
 4 	a = AList.firstNode
 5 	d = DList.firstNode
 6 	outputList = NULL
 7 	while “the input lists are not empty” or “the stack is not empty”
 8 		if a.StartPos > stack.top.EndPos and d.StartPos > stack.top.EndPos
 9 		# cannot contain any d or a: not a solution -> remove
10 		# example: </stackTopElem> ... <nextA> ... <nextD>
11 			tuple = stack.pop()
12 		else if a.StartPos < d.StartPos
13 		# ...and a could be a descendant of top stack element if it isn't going
14 		# to be closed before d starts
15 			stack.push(a)
16 			a = a.nextNode
17 		# (a.StartPos > d.StartPos) but top stack element contains d
18 		# therefore d can't be nested more deeper -> return stack as result
19 		else
20 			for a1=stack.bottom; a1!=NULL; a1=a1.up
21 				append(a1,d) to outputList
22 			end for
23 			# d completed
24 			d = d.nextNode
25 		end if
26 	end while
27 end
Stack-Tree algorithm with comments

This join algorithm provides for a more efficient set-at-a-time strategy performing no unnecessary comparisons while MPMG join algorithm used a node-at-a-time strategy (especially for parent-child relationship).

MPMGJN worst-case (short version)
fig MPMGJN will perform
unnecessary comparisons
while Stack-tree will not.

The problem of the previous algorithm is solved since Stack-tree algorithm will not restart after examining d1, but will use the stack to go back later (for d2).
Thanks to its stack, Stack-tree algorithm can guarantee a linear worst-case complexity (in both CPU and I/O).
The paper presents an efficient join algorithm for binary relationships, but complex queries contains several binary relationships. Finding the optimal join ordering was outside the scope of this paper and will be addressed by the next one.

Holistic Twig Joins algorithms §

Twig Joins is a family of algorithms for processing XML query patterns. They are refered as holistic because they allow to match structural relationships holistically (i.e., as a whole), thus reducing the number of not required temporary results. This strategy is opposed to previously presented join algorithms which only solved the problem of binary relationships while the join ordering of complex queries remained outside the scope of them.
Like Stack-tree, TwigStack algorithm uses a set-at-a-time strategy. The original TwigStack algorithm is explained in the following research paper and many other algorithms have followed such as Twig2Stack, TwigList or TwigMix.

Holistic Twig Joins: Optimal XML Pattern Matching
Nicolas Bruno, Nick Koudas and Divesh Srivastava
SIGMOD 2002

Algorithm overview §

Like all algorithms presented in this section, TwigStack makes use of the extended inverted index representation of XML data. XML query string is also viewed as a tree, but unlike previous solutions, this tree is not divided into query root-to-leaf paths and binary relationships. The inverted list (refered as stream in the article) of each query node is retrieved and linked to the query tree.

TwigStack query processing algorithm overview
fig TwigStack query processing algorithm overview

Algorithm execution is divided into two different steps: phase 1 and phase 2.
In the first phase, query node streams are compared in order to find root-to-leaf paths. In the second phase, those paths are merged to create full query tree again.
As said before, the algorithm does not just solve all root-to-leaf path matching a query path because it would lead to many intermediate results which may not be part of the final answer. The getNext() function ensures that when query twig pattern has only ancestor-descendant edges, each solution to each individual query root-to-leaf path is guaranteed to be merge-joinable with at least one solution to each of the other root-to-leaf paths. This function is key feature of the TwigStack algorithm.

 1 # q: query twig pattern (q = root of the tree)
 2 function twigStack(q)
 3 	# PHASE1
 4 	# End of first phase is reached when
 5 	# ∀qi ∈ subtreeNodes(q) : isLeaf(qi) ⇒ eof(Tqi)
 6 	while !end(q)
 7 		# getNext() call ensures that before a node hq from stream Tq
 8 		# is pushed on its stack:
 9 		# - hq has a descendent h_q_i in each of the streams Tqi.
10 		# - each of the nodes hqi recursively satisfies this too.
11 		q_act = getNext(q)
12 		if !isRoot(q_act)
13 			cleanStack(parent(q_act), nextL(q_act))
14 		end if
15 		if isRoot(q_act) or !empty(“stack of parent(q_act)”)
16 			cleanStack(q_act, nextL(q_act))
17 			moveStreamToStack(“stream of q_act”,
18 			                  “stack of q_act”,
19 			                  “pointer to top(“stack of parent(q_act)”)”)
20 			if isLeaf(q_act)
21 				showSolutionsWithBlocking(“stack of q_act”, 1)
22 				pop(“stack of q_act”)
23 			else
24 				advance(“stream of q_act”)
25 			end if
26 		end if
27 	end while
28 	# PHASE2
29 	mergeAllPathSolutions()
30 end
 1 function getNext(q)
 2 	if isLeaf(q)
 3 		return q
 4 	end if
 5 	foreach q_i in children(q)
 6 		# Recursive call
 7 		n[i] = getNext(q_i)
 8 		if n_i != q_i
 9 			return n_i
10 		end if
11 	end foreach
12 	# Please note that n is an array of query nodes (n=children of q)
13 	n_min = minNextL(n)
14 	n_max = maxNextL(n)
15 	while nextR(q) < nextL(n_max)
16 		advance(q)
17 	end while
18 	ifnextL(q) < nextL(n_min)
19 		return q
20 	else
21 		return n_min
22 	end if
23 end
24 	
1 # T_q: stream of q, S_q: stack of q, p: pointer to parent
2 function moveStreamToStack(T_q, S_q, p)
3 	push(S_q, couple(next(T_q), p))
4 	advance(“stream of q”)
5 end
1 # S: stack, actL: left position of the actual query node
2 function cleanStack(S, actL)
3 	while !empty(S) and topR(S) < actL
4 	pop(S)
5 end
Holistic TwigStack algorithm with comments
Phase 1 §

The holistic twig joins algorithm can perform multiple scans over stream inputs simultaneously while reducing redundant query root-to-leaf path solutions optimally and skipping stream nodes that do not contribute to the solutions. The function showSolutionWithBlocking() does not show the solutions actually (the name comes from the research article). The complete intermediate path solution(s) which are stored in compact stack encoding are built and appended to the list of intermediate path solutions found so far. The second phase will use this list as input to the merge-join process.

Phase 2 §

The second phase merge the intermediate root-to-leaf path solutions. Thanks to the blocking feature of the showSolutionsWithBlocking() function, path solutions are already sorted in root-to-leaf order. Since the input is sorted in order of the common prefix, the second phase is linear in the sum of its input and output. This is also a key feature of the efficiency of the TwigStack algorithm.

Compact stack encoding §

The compact stack encoding is built in the first phase. showSolutionsWithBlocking() is a kind of decoder. During its execution, TwigStack algorithm might find a huge amount partial intermediate path that match the query. If all of those partial paths were stored individually, this intermediate storage could be bigger than input and output storage sizes.
The following example shows how a set of root-to-leaf path solution are encoded using stacks. In this example, the query is //a//b//c. A stack belongs to each node of the query. This stack stores the positional information retrieved from the inverted list (stream) corresponding to the node and a pointer to the parent stack element in the query.

<?xml version="1.0" encoding="UTF-8" ?>
<a>
	<a>
		<b>
			<b>
				<c />
			</b>
		</b>
	</a>
</a>
XML document
Stack encoding
fig Stack encoding of the result
  1. a1 b1 c1
  2. a1 b2 c1
  3. a2 b1 c1
  4. a2 b2 c1
Query result
xhtml valid? | css valid? | last update on September 2010