Development environment for multi platform CUDA software

Vincent Jordan (KDE lab.)

The CUDA GPU computing development framework is available for three operating systems: Microsoft Windows, GNU/Linux and MacOS X. The last two share the same UNIX-like architecture thus CUDA toolkit is quite similar on both of them. The Nvidia CUDA buildchain falls back on the default compiler available from the operating system for host compilation (CPU). GNU Compiler Collection (gcc), Microsoft Visual Studio compiler (cl) or Intel C++ Compiler (icc) can be used by nvcc. The first CUDA SDK released to the public was the 1.1 Beta version in June 2007. At the time of writing, the latest version is 3.1, but the compilation workflow remained the same, despite the numerous improvements of the toolkit.

Before describing each step of building a CUDA software, let me remind the main stages of building any C/C++ software. To start off, preprocessing stage matches some text string and replaces them by others according to macros rules. Then, compilation stage translates the source code into assembly code. Next, assembly code is converted into machine code. Finally, the linking stage creates a connection to the operating system for primitives. This includes adding the runtime library, which mainly consists of memory management routines.
This process can apply to CUDA language as well since it takes place after a conversion of the C++ and CUDA extensions language into regular ANSI C language.

CUDA compilation workflow §

The buildchain consists of several different tools. The following figure shows the complete compilation process and intermediate files of CUDA source file to the final executable file.
The first part is performed by cudafe which split up device (GPU) from host (CPU) code. The device code is then compiled by nvopencc into Parallel Thread eXecution (PTX) code, which is an intermediate assembly. This assembly code is then compiled into CUDA binary (Cubin) by the proprietary ptxas tool. Cubin format is the machine code of the targeted GPU instruction set. Cubin format is proprietary, undocumented and subject to change.

fig Overview of the CUDA compilation workflow (based on explanation found in [NVCC31])

CUDA front end §

cudafe stands for CUDA frontend and has two purpose: preprocessing (with -E option) and CUDA source code analysis. This tool is based on gcc.

Unlike in the standard compilation scheme, preprocessing stage is performed three times. Please note that the .ii extension refers to C++ preprocessed files while .i refers to C preprocessed files.

The figure shows that CUDA frontend is invoked two times.

Nvidia compiler §

Actually the compilation stage of the CUDA toolchain is divided into two parts: high-level and low-level compilations. The intermediate language between these two parts is the PTX assembly. Unlike well-known assembly codes (ARM, x86, ...), PTX is not just converted into Cubin (machine code) as a direct translation.
PTX defines a virtual machine and ISA (Instruction Set Architecture) for general purpose Parallel Thread eXecution (PTX). This compilation stage was introduced to provide a stable ISA that spans multiple GPU generations.
It is worth noting that nvcc is different from nvopencc since nvcc refer to the whole process of compilation (preprocessing and first-only or both compilation stages) while nvopencc refers only to the first stage of compilation producing PTX code.
There are several options for low-level compilation stage:

Option A in the figure: nvcc can generate PTX code only. In that case nvcc = cudafe + nvopencc. PTX will be compiled just-in-time by the graphic driver. This solution is the most flexible since it allows the graphic driver to optimize the CUDA software for each GPU architecture (even future architectures, not yet known at development time).
Option B in the figure: nvcc can generate one or several Cubin codes. In that case nvcc = cudafe + nvopencc + n*ptxas. This solution is more restrictive, but allows specific optimization for a specific GPU architecture.
In both options, there is the possibility to store PTX/Cubin code inside the final binary file as a global initialized data array (using fatbin) or outside the final binary as a .ptx or .cubin external file. In the latter case, the host code will contain extra code needed to load and launch the most appropriate file. This feature is useful because Cubin files can be added, modified or removed just like any file. No need to recompile the host part.

nvopencc (high-level) and ptxas/graphic driver (low-level) both perform some compilation tasks.
nvopencc is a fork of a subset of the open-source Open64 compiler [OPEN64] developed by the Computer Architecture and Paralllel Systems Laboratory (CAPSL) of the University of Delaware. According to Mike Murphy from Nvidia in [MURPHY08], Open64 was chosen for the strength of its optimizations over GCC.
The high-level compiled only uses a subset of Open64 because its input is always C language. Another simplification is that nvopencc does not do any cross-file inter-procedural analysis (IPA) therefore whole kernel source code has to be included into one source file only (this might change in the future).

Low-level compilation is made by Nvidia proprietary Optimized Code Generator (OCG). PTX provides a virtual machine model and is independent of the underlying processor. OCG allocates registers and schedules the instruction according to the targeted GPU chip (Cubin format). Decuda/cudasm tools [DECUDA] can disassemble/assemble these files for G8X and G9X architectures even if it is not supported by nVIDIA. Those tools were created using reverse engineering.

xhtml valid? | css valid? | last update on September 2010