Parallel TwigStack algorithm on GPU

Vincent Jordan (KDE lab.)

The main challenge of implementing XML query processing on GPU is to overcome GPU design for stream processing and its non-divergent threads multiprocessor model because current query processing algorithms do not fit in this paradigm.
Two solutions are possible to solve this issue:

overcome GPU limitations so as to do more than stream processing
create a new algorithm that is stream processing compliant

Given the limited time (6 months internship) and my initial lack of knowledge in XML query processing, the first possibility was chosen. First solution seemed having a more progressive learning curve while the second one promises more efficient processing and better results.

There are several available frameworks for General Purpose GPU computing. CUDA was chosen because it is the most stable and documented framework. CUDA features a hardware debugger. OpenCL support was experimental at the beginning of my internship, thus my implementation depends directly on CUDA library. The price of this choice is being bound to nVIDIA products while an OpenCL-based implementation would have been compatible with many more execution platforms (nVIDIA, ATI and VIA GPU, high-end x86 CPU and more) since OpenCL is supported by several processor makers.
CUDA and OpenCL are very close thereby switching from one to the other is not a such difficult task.

fig CUDA architecture from [CUDA]

As regards CUDA library choice (driver API or runtime), runtime API is easier to use. Since latest CUDA toolkit, both API (cu*) and runtime (cuda*) functions can be mixed in the same application. This convergence was probably made to allow developers to start with runtime, which is easier. When they require more low-level control, they can use driver API without having to upgrade the whole software like they would have to do previously.

The new Fermi architecture was released during my internship as well as several big changes in the CUDA toolkit. My implementation undertook several upgrade because of these toolkit updates, but the version at the time of writing is designed for Tesla architecture. Fermi is not compatible with Tesla at binary level, but video card drivers can compile PTX intermediate language just-in-time for both nVIDIA architectures.

Architecture §

Software architecture undertook several deep changes and several prototypes have been developed. The following figure shows the evolution toward the current prototype. This figure was created from my daily and linear schedule of the whole internship.

fig Implementation design overview

The unlinked grey blocks are the three initial research articles. The red blocks featuring an interrogation mark point out implementation questions that rise from research articles. Of course, research articles are not implementation handbooks. Furthermore, none of the documents address implementation issue in GPGPU environment.
Implementation process is divided into two main prototypes represented by blue blocks. The big orange block shows the discovery of software design problems.

First prototype §

The first prototype was aimed to improve my knowledge of the TwigStack algorithm and was designed for CPU only. It used the GLib (especially its doubly linked list implementation: GList). LibICU library was used for its Unicode word breaker engine. LibXML2 and its DOM interface was used as XML parser. SQLite3 was used as database back-end. A homemade XPath parser has been made using regular expression matching. The metadata generator fed directly the database.

Direct implementation of the TwigStack algorithm was not obvious since the full algorithm pseudocode is not given and what is given, is written using mathematical-style pseudocode. I especially had a misunderstanding problem with the differentiation of vector and scalar variables.

From this very first version, several implementation flaws has been discovered. The generation of the metadata database was very slow and the development of a parser from regular expression is difficult to maintain. Even if the parser is very simple, adding a new parsing feature is not easy.

First prototype (v1.1) §

After having tried MySQL and SQLite in asynchronous mode [SQLITE-ASYNC] without success, it was decided to split the process into two phases. On a suggestion of professor Amagasa, the first step generates simple TSV files (TSV format is similar to CSV and stands for Tab Separated Values) while the second step creates database from loading TSV files. Using this new strategy, performances were dramatically improved.
I have also been asked to introduce XML attribute support in XPath query. Adding new syntax feature to my homemade parser was too much time-consuming therefore the XPath parser was restarted from scratch using tokenizer and parser generators.

Second prototype §

Since Unicode word breaker slowed down overall performances, it was replaced by a simple homemade word breaker. This implementation only uses space, tab or newline characters for word breaking and is not suitable for all languages (e.g. Japanese).

fig Final implementation overview

Query processing from XML file is done in three steps:

A standalone program reads the XML document and make use of libXML SAX parser in order to generate metadata. Metadata are recorded in plain-text file as TSV files: one file for XML elements (tags), one file for XML attribute and one file for text.
sqlite3 command line tool reads database creation scripts: a new SQLite database file is created. This base contains three tables: ELEMENTS, ATTRIBUTES and TEXT, then the metadata of each table is loaded from the corresponding TSV file.
query processor is the main part of the architecture. Its contains an automatically generated parser. According to the query tree, relevant metadata streams are loaded from the database, then the query is performed by TwigStack algorithm and matching part of the XML document are read back from the original file. Another solution would have been to regenerate XML document from metadata information, but they do not contain everything such comments or indentation information.

Windows port §

Official CUDA samples from nVIDIA are built using GNU make building tool on GNU/Linux and Visual Studio project files on Microsoft Windows.

SCons tool makes multiplatform building scripts easier to write. Host compilers (GNU compiler and Visual Studio compiler) do not use the same syntax for compilation options. SCons was created from the beginning for platform-depend command generation and provides an efficient and modular framework to achieve this. I succeeded in adding simple CUDA support to the build process for both Windows and GNU/Linux.

Furthermore, I created Visual Studio project files which fallback on SCons building scripts for compilation. In my opinion, this is the most comportable development process: centralized building script, but specific development tools according to the platform.

Metadata generator §

The W3C specification provides a modified Extended Backus–Naur Form grammar (EBNF) but does not contain any information about how to process XML documents. Several kinds of XML parsers exist. XML parser categories are often represented by their programming interfaces (API) such as Simple API for XML (SAX) and Document Object Model (DOM). SAX is stream oriented while DOM is tree oriented.

SAX API of the XML parser is a much better choice for metadata generation because this process does not require to keep XML document in its original tree structure. DOM API builds the whole XML tree and can exceed available memory when processing big XML documents of several gigabytes. Big XML documents (e.g., Wikipedia XML export) is the main goal of this project.
Metadata generation is a simple pre-order tree traversal. Since XML elements are already stored in the same ordering, metadata generation using SAX is straightforward: XML elements get position is the same order as they are in the document. This is also an advantage of the inverted list representation. A stack of opening tag is built so as to match ending tag when found. Opening (left) and ending (right) positions are stored into the metadata base. When a text string is encountered, word breaker function is called and each word gets an unique position. This process comes from Information Retrieval tradition. It helps to find the distance between words more efficiently.

Query processor §

XPath parser §

XPath parser has been implemented using well-known tools: lex-compliant tokenizer generator [FLEX] and yacc-compliant parser generator [LEMON].

Flex is a tool for generating tokenizers (also known as scanners or lexer). A tokenizer matches lexical patterns in text (token). Another way to match lexical patterns is to use regular expressions, but a specifically generated tokenizer is far more efficient. By default, flex generated tokenizers are not unicode-compliant, but using UTF-8 encoding, the source language code can be looked as a byte stream that is compatible with ASCII encoding.

Input data of the query processor are the XML document and the XPath query. They have to be parsed first. XML parse is done by metadata generator. XPath query is parsed by query processor.
As for XPath parser, W3C specification provides a modified EBNF grammar. This context-free grammar can be used by a parser generator such as well-known Look-Ahead Left-to-right Rightmost derivation parsers (i.e., LALR(1)) generated by GNU Bison, for example.

XPath query can use abbreviated or unabbreviated syntax. For example, the following query:
title[@language=français] is written in abbreviated syntax. Unabbreviated syntax of the same query is:
child::title[attribute::language=français]. Since TwigStack algorithm supports only simple queries containing parent-child and ancestor-descendant, abbreviated syntax is enough, thus the implemented XPath grammar is based on a subset of abbreviated syntax only.

TwigStack algorithm §

TwigStack algorithm is divided in two phases: the first phase matches root-to-leaf paths and the second phase merges those paths into trees according to the query. The first phase only was parallelized on GPU because it fits a data-parallel model. The second phase is performed on CPU. This phase could be parallelized as well, but following a task-parallel model that is not easy to implement using GPU.

Stream partitioning is based on SPX model. SPX model was created for allowing a thread to process both first and second part of the algorithm, therefore XML document has to be partitioned in a way that preserves whole query tree into partitions. Since this feature is not useful in my implementation, SPX partitioning was simplified and only preserves root-to-leaf paths.

TwigStack algorithm makes an intensive use of stacks. This issue was a main point of the implementation: How to handle stack data-structure on GPU?
This question led to the v_array library. The next section presents this library on which is based the TwigStack implementation.

Expected performances §

Because of the choice to implement on GPU the same Parallel TwigStack as made on multi-core CPU, each thread is fully divergent and executes its own TwigStack algorithm on its own partition. This is not a problem on CPU, but it is on GPU. CPU and GPU thread definitions are not the same and a fully divergent cannot be implemented using CUDA. Warps can be independent therefore a whole warp of 32 GPU threads is seen as one CPU thread.
In the current implementation, the first thread only of a warp executes the TwigStack algorithm and the others do nothing. This is indeed a huge waste of GPU resources.

Tesla and Fermi GPU have different warp execution strategies. On Tesla, since one multiprocessor executes one warp at a time, TwigStack parallelism only exists among multiprocessors. 7 cores out of 8 are unused.
On Fermi, a multiprocessor can execute two warps in the same time but 15 cores out of 16 are wasted.
As a consequence of the different core grouping, Tesla GPU has more multiprocessors than Fermi GPU while each one contains less cores. If n is the number of CUDA cores, Tesla can execute n/8 fully divergent threads while Fermi can execute (n/32)*2 = n/16.
High-end Tesla GPU cards have 30 multiprocessors (GeForce GTX 285, 240 cores) and high-end Fermi cards have 15 multiprocessors (GeForce GTX 480, 480 cores). Since a Fermi MP can process two divergent threads, the most advanced Tesla and Fermi cards can finally execute the same number of divergent threads in parallel (30) despite the big difference of available cores.
Fermi architecture introduces other improvements such as more cache and faster thread contexts swap therefore it is still expected to perform better than Tesla.

What about GPU vs CPU?
This is the main question of my thesis and unfortunately, I did not manage to execute my GPU implementation. All design issues have been overcome, but some bugs remain. As explained in the introduction, even GPU is not faster than CPU, it could still be used as an "XML query coprocessor" and relieve CPU workload.

Non-uniform parallelism using CUDA has already been studied in [LERNER08]. The last slide of the presentation is a list of wishes for an improved CUDA toolkit (especially about debugging support). Meanwhile CUDA improved but non-uniform parallelism is still not obvious to implement.

xhtml valid? | css valid? | last update on September 2010