Debugging strategy

Vincent Jordan (KDE lab.)

Since GPU software are uploaded to an external device, one has a very little feedback when something went wrong. If your GPU program gets more complex, the lack of GPU debugging solutions become a real issue.
At the beginning, the CUDA toolkit did not feature any GPU debugger. The nVIDIA solution was to create a minimalistic emulation layer enabled through a compiler option in order to generate CPU only executable from CUDA source code, then standard C language debugging tools can be used on those programs since they are not executed on GPU anymore. This solution is known as device emulation and could be activated with -deviceemu compiler option. This option is deprecated in 3.0 CUDA toolkit and is not available anymore since 3.2 toolkit. nVIDIA recommends to use the new CUDA hardware debugger. This debugger is very useful, but may crash. This section presents other available debugging solutions for CUDA.

"printf" debugging §

A simple way to get more feedback from a software is to make it more verbose. This method is known as "printf" debugging because the printf C function is used to trace program execution. The developer often implements several levels of verbosity and enables them at compilation time through compilation parameters.
At the beginning of CUDA development, "printf" debugging was only available when using device emulation on CPU since GPU of the Tesla architecture were not able to perform any function call (unlike CPU). This solution was not efficient because GPU emulation was too straightforward and led to extremely slow execution (usually an order of magnitude). Atomic functions and race conditions among the thread of a warp were not emulated as well as global and local pointers incompatible address spaces.

A nVIDIA employee created a hack to do a 'sort-of' printf while executing code on GPU, but this function has never been released in official CUDA toolkit. cuPrintf function writes strings in a fixed-size buffer in GPU global memory. When the program executes on CPU again, it reads this buffer and prints out all text strings using standard printf.
This library suffers from many limitations:

Variadic functions are not allowed on GPU. cuPrintf is limited to 10 arguments.
Output buffer has to be allocated prior GPU execution therefore it has a fixed size. Too many cuPrintf calls can lead to overflow.
Outputs are asynchronous. Everything is printed at the same time after GPU execution.

The main limitation comes from its asynchronous behavior because output buffer is printed only when GPU kernel went until the end. When anything wrong happened, nothing is displayed at all.
These problems have been solved in Fermi architecture. Function call feature of this architecture allows system calls like printf.

CUDA Debugging softwares §

A real CUDA hardware debugger was introduced with 2.2 toolkit (cuda-gdb in Linux/Unix toolkit) and was really usable from 3.0 toolkit (in March 2010). As for Visual Studio, Parallel Nsight debugger was made publicly available in July 2010.
The following figure sums up different debugging solutions.

Debugging possibilities for CUDA code

As explained previously, CUDA kernels can be debugged while executed on GPU using CUDA hardware debuggers for Linux or Windows. Those kernels can be debugged while implemented on CPU (but this feature is available anymore) using standard debuggers. There is a third solution: Ocelot code translation.

Translating GPU binaries to tiered SIMD architectures with Ocelot
Gregory Diamos, Andrew Kerr, Mukil Kesavan
Georgia Institute of Technology, Technical Report, January 2009

PTX kernels can be emulated or translated just-in-time to CPU target. Since Ocelot infrastructure performs a deep analysis of CUDA kernel, it does an accurate "bug-to-bug" emulation and enables efficient debugging [DIAMOS10].
Ocelot is the result of a research project about dynamic compilation framework for heterogeneous systems (quoted from Ocelot project website [OCELOT]). Several back-end were created, but the ones linked to CUDA are the most active.
Other research project addressed the same target, such as the Barra project [BARA], but Ocelot is the only remaining active project. At the time of writing, PTX 2.0 is still not yet fully supported.

The main limitation of this solution is that it works at PTX level. ocelot-gdb tool cannot read cuda-gdb debugging symbols. Matching the faulty CUDA code from PTX is not an easy task and it supposes a deep knowledge of PTX assembly.

Another solution §

The v_array library was designed to allow another debugging opportunity. Since CUDA language is mainly C language with a small extension, it can be translated into plain C language using preprocessor macro. The idea is to preprocess the same code for CPU (plain C) and GPU (extended C). This possibility is available according to the CUDA project. Since both versions can be executed in one executable, it becomes possible to compare in-memory result of each version. Thanks to v_array memory manager which was designed to contain the same data on both sides, it is possible to compare the whole memory state at byte level.

Comparing memory manager content between CPU and GPU memory

This process can be automated into unit tests. Reference version for CPU (aka. gold version) can be automatically compared with GPU version. Using this process, developer is able to immediately make apart general bugs from GPU-specific bugs. Unit tests of the v_array library are written this way. SCons build script automatically preprocess two versions of the CUDA kernel function. Functions name are prefixed according to each version.

xhtml valid? | css valid? | last update on November 2010