v_array datastructure

v_array is an implementation of unrolled doubly-linked list data-structure written in C language. This variation of the standard linked list is actually list of arrays since Each node contains several data elements. The data-structure was created for dynamic memory allocation. Dynamic memory allocation allows to implement and experiment using CUDA, algorithms that do not fit the current programming model.
It was designed to be very generic and to be used for CUDA development. Its main purpose is to enable developers to create, from same source code, several versions that use different kind of memory access (e.g. CPU or GPU main memory).

Why a new data-structure? The library has been created because of the limitations of nVIDIA GPU with compute capability of 1.1 ➜ 1.3 (aka. Tesla architecture).

Features §

One of the main issue about a data-structure is its complexity. The numerous different data-structures available in computer science shows that a trade-off between flexibility and efficiency is always required. Each solution has strengths and weaknesses. The purpose of this section is to explain why v_array can be a better trade-off than others (this is not restricted to GPU processing).

A good trade-off between indexing and insertion costs §

The most well-known solutions to store a dynamic sequence of data are the linked list and the dynamic array. As explained previously, unrolled linked list is a variation of the standard linked list. It is not a mix between linked list and dynamic array since nodes of the list do not double each time the previous array is full.

fig v_array memory representation

CPU processor architectures are optimized for loops on linear array. Branch prediction is designed to prevent from emptying the processor pipeline at each loop iteration. Cache memory is designed to prevent from fetching from the main memory when accessing the row of an array. Random memory access lowers the performances since cache miss rate is high.
Although GPU architectures do not rely on cache to optimize memory access, random memory access remains poorly efficient since it prevents memory coalescing.
The following table sums up advantages and drawbacks of several memory structure according to four criteria:

Indexing costs is defined by the number of memory access in order to reach the n^th row of the data-structure. Some memory structures such as linear array allows to compute the pointer of any row from the base pointer of the array. Some other structures are scattered in memory.

Inserting costs is defined by the number of memory allocation in order to insert n row (one-by-one) in the data-structure. For example, each time a row is inserted into a linked list, there is a new memory allocation. This is not the case for dynamic array since additional free rows are allocated in prevision. The value of the table is not the cost of insertion in the middle of the sequence.

Memory costs is defined by the number of memory unit which does not contain row data. For example, each node of a doubly-linked list has to store two pointers.

Cache miss costs is defined by moment when next memory access is not located just beside the previous one (spatial locality). Memory structures having high data scattering have also higher cache miss rate.

	indexing (average number of memory access)	insertion (average number of memory allocation)	additional memory space (per row)	cache miss (when?)
linked list	$\frac{n + 1}{2}$	$1$	$2 * sizeof (pointer)$	at each row
unrolled linked list	$⌈ \frac{n (n + m)}{2 m} ⌉ \frac{1}{n}$	$\frac{1}{m}$	$\frac{⌈ \frac{n}{m} ⌉ * 2 * sizeof (pointer) + (⌈ \frac{n}{m} ⌉ * m - n) * sizeof (row)}{n}$	at each node ( $m$ rows)
dynamic array	$1$	$\frac{⌈ \frac{ln (n)}{ln (2)} ⌉ + 1}{n}$	$\frac{{(2}^{⌈ \frac{ln (n)}{ln (2)} ⌉} - n) * sizeof (row)}{n}$	at each array reallocation
array	$1$	N/A	$0$	at first array access only

Complexity of different data-structures. n is the number of rows and m is the number of row in each node of the unrolled linked list.

The following plots are a graphical representation of the table above. Unfortunately, I was unable to draw discontinuous plot using gnuplot. At each step, the relevant value is the right one (in the right corner of the step). Please keep in mind that the represented values are the average costs.

Dynamic array is famous for its amortized inserting cost at the price of increasing memory cost. This data-structure is not easy to implement on a simple memory manager because of memory fragmentation and massive reallocations. On nVIDIA GPU, unrolled linked list is likely to have m = 32 since it allows to perform computation on a node of the list using a whole GPU warp. Memory allocations are always of the same size, thus enabling a very simple memory allocator. Unrolled linked list also makes easier the creation of a manual paging system in shared memory of the GPU (list node = memory page).

Dynamic memory structure §

On the Tesla architecture, memory cannot be allocated dynamically. Allocating memory in advance is one solution. Another solution is using a CPU callback when GPU memory allocation is required.

fig GPU to CPU callback

GPU-to-CPU callbacks were experimented using pools of zero-copy memory and polling. This method is designed so that GPU can issue any CPU function callbacks. When GPU need to execute a CPU function, it writes function parameters into the zero-copy memory and sets a flag. Since CPU is polling this memory area meanwhile, the flag setting triggers the execution of the appropriate CPU function and then unset the flag. The same flag polling is performed on the GPU while the CPU executes the callback function. The flag unsetting triggers the GPU thread resume.
CPU callback would be an elegant solution for dynamic memory allocation if polling was replaced by signal/interrupts (sleep-base mechanism) [STUART10]. Unfortunately, Tesla architecture (or Fermi) does not support a such feature.

v_array contains a basic memory manager in order to simulate dynamic memory allocation on GPU using the other option (i.e., big memory allocation). The implementation was made straightforward: memory allocated size is fixed and should match the node's size of unrolled linked list. The header of the memory pool is a simple bitfield which stores memory chunk states: free or allocated.

fig Memory pool for dynamic memory allocation

This simple solution was studied specifically in [HUANG10] and several efficiency improvements over this solution are shown.

Generic but function pointers not required §

In order to allow both CPU and GPU usage, all memory accesses of the library (allocation, copy, unallocation) were made generic through function pointers. Unfortunately, GPU of the Tesla architecture cannot execute a such code since all function calls have to be inlined, thus memory accesses were also made generic through macro functions.

fig v_array memory access possibilities

Offsets instead of pointers §

Even with pinned memory, GPU and CPU never share the same memory address space. In the case of stream processing, memory structures are static and do not likely contain any pointers. The case of a linked list is different since the links are made of pointers. Of course, dereferencing on the GPU a pointer to CPU memory will lead to crash, or worse, to read memory from an unexpected area (an issue that is hard to debug).
Since there is no memory protection, reading or writing to unallocated memory does not trigger any memory segmentation fault. If the GPU is used to display too, you can even accidentally scribble over video memory which produces artistic screen results and forces you to reboot the computer.

Using memory pool and offset from the base pointer of the pool instead of pointers has the advantage of creating identical memory space on both CPU and GPU, therefore memory area can easily be swapped in a single memory copy.

fig Recommended usage of v_array

In a such case, memory alignment has to be taken into account so as not to reduce GPU performances.

Memory spaces and iterators §

Since the GPU architecture contains several addressable memory spaces having addressing costs with a difference of two orders of magnitude, it is important to avoid some memory accesses as much as possible. Another issue is related to compiler inference of pointer targets. Iterator structure has been introduced to address these issues.
Translation in the sequence of row is done relative to the iterator's current position instead of absolute to the sequence beginning or end. Since iterator is always stored in local memory, there are no possible confusion when passing it as argument to a function.

fig Memory fragmentation

Memory space confusion is specific to the Tesla architecture since the Fermi architecture solves this issue thanks to its unified pointer feature.

Other data-structures based on v_array §

Stack §

The v_array implementation already meets all requirements for an efficient double-ended queue data-structure (deque for short). A stack can be seen as a special case of a deque in which one end only is used.

Tree §

Implementing a tree data-structure into a v_array is less obvious than a stack. In order to keep the advantages of the v_array, some limitations to the v_tree operations have been decided. The insertion possibilities were limited to the right side of the tree since insertion/deletion at any position of the list is not implemented in v_array.

fig A tree example

On the other hand, some operations are very efficient like tree traversal (if the in-memory order is the same as the tree traversal order). Tree nodes are serialized into a sequence of nodes.

fig v_tree storage into v_array

API documentation §

API documentation generated with Doxygen can be found at http://kde.cs.tsukuba.ac.jp/~vjordan/docs/v_array/api/. Doxygen generates documentation from comments in the source code thereby creating an always up-to-date documentation.

The v_array data-structure

Introduction §