Since GPU software are uploaded to an external device, one has a very little feedback when something went wrong. If your GPU program gets more complex, the lack of GPU debugging solutions become a real issue.
At the beginning, the CUDA toolkit did not feature any GPU debugger. The nVIDIA solution was to create a minimalistic emulation layer enabled through a compiler option in order to generate CPU only executable from CUDA source code, then standard C language debugging tools can be used on those programs since they are not executed on GPU anymore. This solution is known as device emulation and could be activated with -deviceemu
compiler option. This option is deprecated in 3.0 CUDA toolkit and is not available anymore since 3.2 toolkit. nVIDIA recommends to use the new CUDA hardware debugger. This debugger is very useful, but may crash. This section presents other available debugging solutions for CUDA.
A simple way to get more feedback from a software is to make it more verbose. This method is known as "printf" debugging because the printf
C function is used to trace program execution. The developer often implements several levels of verbosity and enables them at compilation time through compilation parameters.
At the beginning of CUDA development, "printf" debugging was only available when using device emulation on CPU since GPU of the Tesla architecture were not able to perform any function call (unlike CPU). This solution was not efficient because GPU emulation was too straightforward and led to extremely slow execution (usually an order of magnitude). Atomic functions and race conditions among the thread of a warp were not emulated as well as global and local pointers incompatible address spaces.
A nVIDIA employee created a hack to do a 'sort-of' printf
while executing code on GPU, but this function has never been released in official CUDA toolkit. cuPrintf
function writes strings in a fixed-size buffer in GPU global memory. When the program executes on CPU again, it reads this buffer and prints out all text strings using standard printf
.
This library suffers from many limitations:
cuPrintf
is limited to 10 arguments.cuPrintf
calls can lead to overflow.printf
.
A real CUDA hardware debugger was introduced with 2.2 toolkit (cuda-gdb in Linux/Unix toolkit) and was really usable from 3.0 toolkit (in March 2010). As for Visual Studio, Parallel Nsight debugger was made publicly available in July 2010.
The following figure sums up different debugging solutions.
As explained previously, CUDA kernels can be debugged while executed on GPU using CUDA hardware debuggers for Linux or Windows. Those kernels can be debugged while implemented on CPU (but this feature is available anymore) using standard debuggers. There is a third solution: Ocelot code translation.
PTX kernels can be emulated or translated just-in-time to CPU target. Since Ocelot infrastructure performs a deep analysis of CUDA kernel, it does an accurate "bug-to-bug" emulation and enables efficient debugging [DIAMOS10].
Ocelot is the result of a research project about dynamic compilation framework for heterogeneous systems (quoted from Ocelot project website [OCELOT]). Several back-end were created, but the ones linked to CUDA are the most active.
Other research project addressed the same target, such as the Barra project [BARA], but Ocelot is the only remaining active project.
At the time of writing, PTX 2.0 is still not yet fully supported.
The main limitation of this solution is that it works at PTX level. ocelot-gdb
tool cannot read cuda-gdb
debugging symbols. Matching the faulty CUDA code from PTX is not an easy task and it supposes a deep knowledge of PTX assembly.
The v_array library was designed to allow another debugging opportunity. Since CUDA language is mainly C language with a small extension, it can be translated into plain C language using preprocessor macro. The idea is to preprocess the same code for CPU (plain C) and GPU (extended C). This possibility is available according to the CUDA project. Since both versions can be executed in one executable, it becomes possible to compare in-memory result of each version. Thanks to v_array memory manager which was designed to contain the same data on both sides, it is possible to compare the whole memory state at byte level.
This process can be automated into unit tests. Reference version for CPU (aka. gold version) can be automatically compared with GPU version. Using this process, developer is able to immediately make apart general bugs from GPU-specific bugs. Unit tests of the v_array library are written this way. SCons build script automatically preprocess two versions of the CUDA kernel function. Functions name are prefixed according to each version.