The CUDA GPU computing development framework is available for three operating systems: Microsoft Windows, GNU/Linux and MacOS X. The last two share the same UNIX-like architecture thus CUDA toolkit is quite similar on both of them. The Nvidia CUDA buildchain falls back on the default compiler available from the operating system for host compilation (CPU). GNU Compiler Collection (gcc
), Microsoft Visual Studio compiler (cl
) or Intel C++ Compiler (icc
) can be used by nvcc. The first CUDA SDK released to the public was the 1.1 Beta version in June 2007. At the time of writing, the latest version is 3.1, but the compilation workflow remained the same, despite the numerous improvements of the toolkit.
Before describing each step of building a CUDA software, let me remind the main stages of building any C/C++ software. To start off, preprocessing stage matches some text string and replaces them by others according to macros rules. Then, compilation stage translates the source code into assembly code. Next, assembly code is converted into machine code. Finally, the linking stage creates a connection to the operating system for primitives. This includes adding the runtime library, which mainly consists of memory management routines.
This process can apply to CUDA language as well since it takes place after a conversion of the C++ and CUDA extensions language into regular ANSI C language.
The buildchain consists of several different tools. The following figure shows the complete compilation process and intermediate files of CUDA source file to the final executable file.
The first part is performed by cudafe
which split up device (GPU) from host (CPU) code. The device code is then compiled by nvopencc
into Parallel Thread eXecution (PTX) code, which is an intermediate assembly. This assembly code is then compiled into CUDA binary (Cubin) by the proprietary ptxas
tool. Cubin format is the machine code of the targeted GPU instruction set. Cubin format is proprietary, undocumented and subject to change.
cudafe
stands for CUDA frontend and has two purpose: preprocessing (with -E
option) and CUDA source code analysis. This tool is based on gcc.
Unlike in the standard compilation scheme, preprocessing stage is performed three times. Please note that the .ii
extension refers to C++ preprocessed files while .i
refers to C preprocessed files.
The figure shows that CUDA frontend is invoked two times.
Actually the compilation stage of the CUDA toolchain is divided into two parts: high-level and low-level compilations. The intermediate language between these two parts is the PTX assembly. Unlike well-known assembly codes (ARM, x86, ...), PTX is not just converted into Cubin (machine code) as a direct translation.
PTX defines a virtual machine and ISA (Instruction Set Architecture) for general purpose Parallel Thread eXecution (PTX). This compilation stage was introduced to provide a stable ISA that spans multiple GPU generations.
It is worth noting that nvcc
is different from nvopencc
since nvcc
refer to the whole process of compilation (preprocessing and first-only or both compilation stages) while nvopencc
refers only to the first stage of compilation producing PTX code.
There are several options for low-level compilation stage:
nvcc
can generate PTX code only. In that case nvcc
= cudafe
+ nvopencc
. PTX will be compiled just-in-time by the graphic driver. This solution is the most flexible since it allows the graphic driver to optimize the CUDA software for each GPU architecture (even future architectures, not yet known at development time).nvcc
can generate one or several Cubin codes. In that case nvcc
= cudafe
+ nvopencc
+ n*ptxas
. This solution is more restrictive, but allows specific optimization for a specific GPU architecture.fatbin
) or outside the final binary as a .ptx
or .cubin
external file. In the latter case, the host code will contain extra code needed to load and launch the most appropriate file. This feature is useful because Cubin files can be added, modified or removed just like any file. No need to recompile the host part.
nvopencc
(high-level) and ptxas
/graphic driver (low-level) both perform some compilation tasks.
nvopencc
is a fork of a subset of the open-source Open64 compiler [OPEN64] developed by the Computer Architecture and Paralllel Systems Laboratory (CAPSL) of the University of Delaware. According to Mike Murphy from Nvidia in [MURPHY08], Open64 was chosen for the strength of its optimizations over GCC.
The high-level compiled only uses a subset of Open64 because its input is always C language. Another simplification is that nvopencc
does not do any cross-file inter-procedural analysis (IPA) therefore whole kernel source code has to be included into one source file only (this might change in the future).
Low-level compilation is made by Nvidia proprietary Optimized Code Generator (OCG). PTX provides a virtual machine model and is independent of the underlying processor. OCG allocates registers and schedules the instruction according to the targeted GPU chip (Cubin format). Decuda/cudasm tools [DECUDA] can disassemble/assemble these files for G8X and G9X architectures even if it is not supported by nVIDIA. Those tools were created using reverse engineering.