GPGPU (Many-Core) promises speedup that can reach an order of magnitude over current CPU (Multicore) architectures. GPU computing is as quite new phenomenon (as of 2010). Previously these processing units were dedicated to 2D/3D rendering and some specifically wired video acceleration. 3D rendering standard specifications (OpenGL and DirectX) included new features at each new version, especially about shaders capabilities, thus GPU became suitable for general purpose stream processing.
Memory accesses are among the slowest operations of a processor, due to the fact that Moore's law has increased instruction performance at a much greater rate than it has increased memory performance. This difference in performance increase rate means that memory operations have been getting expensive compared to simple register-to-register instructions. Modern CPUs sport large caches in order to reduce the overhead of these expensive memory accesses.
GPUs use another strategy so as to cope with this issue. Massive parallelism can "feed" the GPU with enough computational operations while waiting for pending memory operations to finish. This different execution strategy implies to look for new implementation ideas.
The GPU is especially well-suited to address problems that can be expressed as data-parallel computation (the same program is executed on many data elements in parallel) with high arithmetic intensity (the ratio of arithmetic operation to memory operations). This architecture was designed for image rendering (3D) and processing (video playback) but data-parallel processing can be also found in physics simulation, signal processing, computational finance or biology. An algorithm that is data-parallel is also referred as embarrassingly parallel. Those algorithms can be accelerated radically using GPU.
Since GPU implementation is still highly dependant to the underlying hardware, the purpose of this chapter is to show the evolution of the graphic hardware which explains its current limitations.
This section will explain what happened to graphic processing units with an eye to become a suitable architecture for general purpose parallel processing.
The graphics adapter in PC compatibles computers started from a simple memory-mapped frame buffer. They became devices with 2D and 3D hardware acceleration. These devices contain their own memory for the frame buffer. This memory has two accesses: computer system can read/write picture into it and video out hardware reads it in order to create a signal for the display device(s).
A fast historical review is showed in the following timeline figure (evolution of Direct3D pipeline is based on [THOMSON06]):
NV1
is the first product of nVIDIA (released in 1995) and uses its own homemade graphic pipeline based on quadratic surfaces.NV3
is the second product of nVIDIA (1997) and uses a totally different pipeline based on polygon surfaces so that it can match Microsoft DirectX 5.0 requirements.NV10
is the third generation of nVIDIA GPU (1999). New fixed-functions have been added for DirectX 7.0 compliance, such as hardware Transform and Lighting (T&L) functions.
1.1 | 2.0 | 2.x | 3.0 | |
---|---|---|---|---|
instruction count | 128 | 256 | ≥ 256 | ≥ 512 |
2.0 | 2.x | 3.0 | |
---|---|---|---|
call nesting | 1 | 1-4 | 4 |
static conditions | 16 | 16 | 24 |
dynamic conditions | n/a | 0-24 | 24 |
loop nesting | 1 | 1-4 | 4 |
static flow count | 16 | 16 | ∞ |
The "Little law" from Sylvain Collange: data = throughput × latency
The evolution of GPU capability toward more general purpose computation and the evolution of CPU capability toward more parallel and specific computation creates figure found beside: a convergence of CPU and GPU.
CPUs have good latency thanks to wide and coherent caches while GPUs have good throughput thanks to massive and hardware-handled parallelism.
Telsa is the product name of the first GPGPU-only device made by nVIDIA. It also became the name of the first CUDA-compliant GPU architecture. All GPU having compute capability from 1.0 to 1.3 belong to the Tesla architecture despite using other product name than Tesla (i.e. GeForce, Quadro).
Unlike current CPU dies, GPU dies do not feature big cache memory areas. Most of the transistors are used for computation. Processing units are divided and nested into several groups and subgroups. From die point of view, the main division of the Tesla architecture is the Thread Processing Clusters (TPC) [WONG10]. Each TPC is linked to an interconnection network (probably of crossbar type according to [COLLANGE10]) which also links DRAM memory controllers. Memory is partitioned into 1 to 8 controllers. Memory chips are found outside the GPU chip. Each TPC contains 2 or 3 stream processors (SM, also named multiprocessors).
Using the GPU from host for general purpose computation is generally performed through three steps: data copy from host to device memory, execution on device, data copy from device to host. This reveals the first issue of GPGPU computing: the time wasted in memory transfers.
A GPU is connected to the host through a high-speed bus such as PCI-Express 16x. This bus is mainly used for DMA transfers between CPU and GPU memories since none of them is able to directly address the memory space of the other.
There are several options for data transfer:
malloc()
function, the memory is paged. This mechanism allows the OS to allocate more memory than physically available. Pages of memory are swapped on-the-fly between hard drive and main host memory. When performing a memory copy between host and device, device will have to poll the CPU to know when the memory transfer has ended (source [FARIAS]).When data and program has been loaded into GPU memory, execution can start since the number of stream multiprocessors and thread per SM is decided in advance. On the Tesla architecture, a SM can have 768 or 1024 threads simultaneously active according to the chip. The number of SM depends on the GPU: entry-level GPU has a few SM while high-end GPU has a lot of them. The GPU of the example has 24 multiprocessors therefore it can deal with 24576 active threads. Of course, all threads are not executed in the same time since a multiprocessor has only eight cores, so-called CUDA cores. Following the same example, 192 threads are executed at each clock cycle. In the worst case, 128 cycles are required to execute one instruction of all threads. The GPU can perform zero-cost hardware-based context switching and immediately switch to another thread to process.
A big difference between CPU and GPU is the memory hierarchy. Registers and shared memory are extremely fast while global memory is several hundred times slower. It is worth noting that the shared memory is not a hardware cache, but is a scratchpad memory [W-SCRPAD]. Each SM has a such local high-speed internal memory. It can be seen as a L1 cache, but there are crucial differences: explicit instruction are required to move data from global memory to shared memory and there is no coherency among scratchpads and global memory. When used as a cache memory, if global memory changes scratchpad's content will not be updated. Shared memory is considered as fast as register memory as long as there are no bank conflicts among threads.
Global memory is linked to the GPU chip through a very large data path: up to 512-bits wide. Through a such bus width, sixteen consecutive 32-bits words can be fetched from global memory in a single cycle. The Quadro FX4800 of my running example has a 384-bits bus width and GDDR3 memory at 1.6 GHz allowing a theoretical 76.8 GB/s memory bandwidth. In real programs, this bandwidth is difficult to obtain since memory has to be aligned and threads accesses have to be coalesced. Coalescing allows to merge all independent thread memory access into one big access. Several thread access patterns are recognized by coalescing hardware. It also means there is severe bandwidth degradation for stridden accesses.
CUDA cores can handle integer and single-precision floating operations. When a thread encounters a transcendental and double precision operations, CUDA cores cannot be used. There are two special function units for transcendental operations and one double precision unit for double precision operation. Transcendental operation are, for example, sine, cosine or logarithm instructions. SFU also compute integer multiplication when 32-bits precision is required (instead of 24-bits). Obviously, in a such case, the SM cannot perform more than two or one operation per cycle. Tesla generation GPU which support double-precision are eight times slower in double- than single-precision [COLLANGE10]. According to nVIDIA documentation [GTX200], "double-precision performance of all 10 TPCs of a GeForce GTX 280 GPU is roughly equivalent to an eight-core Intel Xeon CPU, yielding up to 78 gigaflops".
While the product name of CUDA-only device remained Tesla, its latest architecture's name changed to Fermi. All GPU having compute capability from 2.0 and higher (and below 3.0 probably) are based on the Fermi architecture.
Fermi architecture introduces a 768KB fully coherent L2 cache common to all SM (referred as unified cache). This new cache has the same purpose as CPU L2 cache and is a significant change from previous architecture. This key feature shows a real general purpose orientation of the nVIDIA GPU. This cache does not accelerate graphics computation.
Another non-graphical feature is that all memories (from register to DRAM) can be protected by ECC (Error-Correcting Code).
Thread Processing Clusters division disappeared.
Unified address space enables full C++ support [FERMI]. In Tesla architecture, linear memory was addressed using a 32-bit address space. Load/store instructions were specific to different memory area. Fermi architecture uses a 40-bit unified address space. Load/store instructions can use 64-bit address width for future growth.
The previous shared memory space became a configurable scratchpad/L1 cache. Fermi architecture has two configuration option: 48 KB of shared memory and 16 KB of L1 cache or 16 KB of shared memory and 48 KB of cache.
This new architecture upgrades memory from GDDR3 to GDDR5. Despite its narrower memory bus width (256 vs 384-bit), the card of the figure has a theoretical memory bandwidth of 102.6 GB/s.
Devices of compute capability 2.x can perform a copy from page-locked host memory to device memory concurrently with a copy from device memory to page-locked host memory because GPUs of the Fermi architecture have 2 copy-engines. Since PCIexpess bus is duplex, total bandwidth is doubled in a such case.
CUDA stands for Compute Unified Device Architecture. CUDA is an extension of the C language for GPGPU computing. The programming model is tightly coupled with the architecture of nVIDIA graphic cards. Concept of the programming model can be mapped to hardware implementation.
1 /* GPU kernel */
2 __global__ void vecAdd(float* a, float* b, float* c) {
3 int i = blockDim.x * blockIdx.x + threadIdx.x;
4 c[i] = a[i] + b[i];
5 }
6
7 /* function relying on GPU */
8 void parallelAdd(int size, float* h_a, float* h_b, float* h_c) {
9 size_t vector_byte_size = size * sizeof(float);
10 float *d_a, *d_b, *d_c;
11
12 /* STEP 1: memory allocation for GPU exec. */
13 /* allocate some GPU memory and copy input vectors */
14 cudaMalloc(&d_a, vector_byte_size);
15 cudaMalloc(&d_b, vector_byte_size);
16 cudaMalloc(&d_c, vector_byte_size);
17 cudaMemcpy(d_a, h_a, vector_byte_size, cudaMemcpyHostToDevice);
18 cudaMemcpy(d_b, h_b, vector_byte_size, cudaMemcpyHostToDevice);
19
20 /* STEP 2: execution on device */
21 vecAdd<<<2,size/2>>>(d_a, d_b, d_c);
22
23 /* STEP 3: results retrieval */
24 /* copy GPU result to CPU memory and free GPU memory */
25 cudaMemcpy(h_c, d_c, vector_byte_size, cudaMemcpyDeviceToHost);
26 cudaFree(d_a);
27 cudaFree(d_b);
28 cudaFree(d_c);
29 }
30
31 /* function relying on CPU */
32 void serialAdd(int size, float* a, float* b, float* c) {
33 int i;
34 for(i=0; i<size; i++) {
35 c[i] = a[i] + b[i];
36 }
37 }
38
39 int main()
40 {
41 int vector_size = 102;
42 size_t vector_byte_size = vector_size * sizeof(float);
43 float *a, *b, *c;
44
45 a = (float*)malloc(vector_byte_size);
46 b = (float*)malloc(vector_byte_size);
47 c = (float*)malloc(vector_byte_size);
48
49 /* put some values into a and b vectors */
50 [...]
51
52 /* serialAdd(vector_size, a, b, c); */
53 parallelAdd(vector_size, a, b, c);
54
55 /* now c = a + b and c can be used */
56 [...]
57
58 free(a);
59 free(b);
60 free(c);
61
62 return 0;
63 }
The code example on the right will be used to show the mapping between CUDA programming model and GPU hardware execution model. This example also shows how a part of a software can be accelerated on GPGPU using the CUDA toolkit.
From line 1 to 5, there is the GPU code (aka. device). Everything else is executed on CPU (aka. host). parallelAdd()
function is an ad hoc replacement of the serialAdd()
function. parallelAdd()
deals with all GPU specificities. As explained previously, the execution is performed through three steps. In the first step, cudaMalloc()
and cudaMemcpy()
allocate and load the required input data in GPU global memory. In this parallelAdd()
function, there are both host (start with h
) and device (start with d
) pointers. The second step is kernel call. A kernel is a program executed on GPU. The calling syntax use a CUDA specific notation: <<<2, size/2>>>
specifies the number of block (2
) and the number of thread per block (size/2
).
Like processing units in hardware, threads are also divided into groups. The following schema shows one grouping possibility. As for Grid and Block sizes, they are defined by developer at Kernel execution. Warp size is fixed by hardware and is 32 for both Tesla and Fermi, but this size might change in future architectures according to nVIDIA CUDA documentation.
Grouping possibilities for 102 threads are <<<1, 102>>>
, <<<2, 51>>>
, <<<3, 34>>>
and <<<6, 17>>>
. Since the basic unit of execution flow in a multiprocessor is a warp of 32 thread, it is useless to execute less that 32 threads in a block. Actually, if there was a condition so as to prevent from going out of bound of the vector arrays when the thread number exceed vector size, we could use any <<<x, ⌈size/x⌉>>>
.
The main difference between Tesla and Fermi architecture is in the way they execute warps of threads.
On a Tesla, the eight cores execute a warp in four clock cycles (8 by 8). A Fermi multiprocessor, each group of 16 cores execute a different warp. Therefore, two warps are executed in two clock cycles (2×16 by 2×16).
A question about divergent threads rises from this execution strategy: What happens when threads do not execute the same code in a warp?.
In the following example, execution path depends on the thread id. This case is handled differently on Tesla and Fermi.
1 /* GPU kernel */
2 __global__ void vecAdd(float* a, float* b, float* c) {
3 int i = blockDim.x * blockIdx.x + threadIdx.x;
4
5 if(i & 1) { /* if i is odd */
6 c[i] = a[i] + b[i];
7 } else {
8 c[i] = a[i] - b[i];
9 }
10 }
On the Tesla architecture, each conditional branch is serialized. According to [WONG10], "else clause" is always executed first while other clauses are disabled, then "if clause" is executed (and "else clause" disabled).
Fermi architecture improves this issue because each multiprocessor features a dual warp scheduler. Each group of 16 cores can execute a different conditional branch. Our divergent thread example would be executed in parallel on Fermi architecture only. Of course, it works for half-warps only.
On devices of compute capability 2.x, function call is available. The size of the call stack can be queried using cudaThreadGetLimit()
and set using cudaThreadSetLimit()
.
Despite using the same name, the word thread has a different definition on CPU or on GPU and can lead to misunderstanding. Unlike in CPU, GPU threads are managed by hardware. Classical thread programming techniques do not match GPU thread design. Translation from POSIX thread (Linux) or Windows thread models to CUDA thread is not obvious nor recommended. I quote this analogy since I think it helped me get the difference more clearly "the definition that applies to CUDA is threads in a fabric running lengthwise" [PPBLOG]. In other words, CUDA threads should not diverge for optimal performances. Divergent threads are not impossible to implement, but they can dramatically lower the performances.
Single-Instruction Multiple-Thread (SIMT) is the name given by nVIDIA to this execution strategy. Every thread in the same warp execute the same instruction in lockstep, but all threads can branch separately although it would lead to extremely bad performances even on the Fermi architecture.
Both Tesla and Fermi architectures have several memory areas having different memory consistency models. CUDA toolkit is very system-centric (according to [GHARACH95] definition) and memory management can quickly become a real nightmare for programmers.
A state space is a storage area with particular characteristics.
name | addressable | access | cost | lifetime | description |
---|---|---|---|---|---|
.reg | No | R/W | 0 | thread | registers |
.sreg | No | RO | 0 | thread | special registers (platform-specific) |
.const | Yes | RO | 0* | application | constant memory |
.global | Yes | R/W | > 100 | application | global memory |
.local | Yes | R/W | > 100 | thread | local memory (private to each thread) |
.param | Yes | RO | 0 | user parameters for a program | |
.shared | Yes | R/W | 0 | block | shared memory |
.surf | via surface instruction | R/W | > 100 | application | surface memory (global memory) |
.tex | via texture instruction | RO | > 100 | application | texture memory (global memory) |
Each addressable state space has its own address space. It means that memory load instruction is different according to the memory space. For each dereferencing pointer code, nVIDIA compiler infers at compilation time which memory space is the right space. This process may fail like in the example below.
5 __device__ void myCopyFunction(void* dest_ptr,
void* src_ptr, size_t size) {
6 char* dest = (char*)dest_ptr;
7 char* src = (char*)src_ptr;
8
9 while(size-- > 0) {
10 *dest++ = *src++;
11 }
12 }
13
14 __global__ void testKernel1(int* arg, void* mem)
15 {
16 int temp = 20;
17 *(void**)mem = (void*)&temp;
18 }
19
20 __global__ void testKernel2(int* arg, void* mem)
21 {
22 myCopyFunction(arg, *(void**)mem, sizeof(int));
23 }
46 $LDWbegin__Z11testKernel2PiPv:
47 .loc 17 6 0
48 ld.param.u64 %rd1,
[__cudaparm__Z11testKernel2PiPv_arg];
49 .loc 17 7 0
50 ld.param.u64 %rd2,
[__cudaparm__Z11testKernel2PiPv_mem];
51 ld.global.u64 %rd3, [%rd2+0];
52 mov.s64 %rd4, 3;
53 $Lt_1_1794:
54 //<loop> Loop body line 7, nesting depth: 1,
estimated iterations: unknown
55 .loc 17 10 0
56 add.u64 %rd3, %rd3, 1;
57 add.u64 %rd1, %rd1, 1;
58 ld.global.s8 %rh1, [%rd3+-1];
59 st.global.s8 [%rd1+-1], %rh1;
60 .loc 17 9 0
61 sub.u64 %rd4, %rd4, 1;
62 mov.u64 %rd5, -1;
63 setp.ne.u64 %p1, %rd4, %rd5;
64 @%p1 bra $Lt_1_1794;
65 .loc 17 23 0
66 exit;
67 $LDWend__Z11testKernel2PiPv:
68 } // _Z11testKernel2PiPv
The following warning is generated at compilation time:
./test.cu(10): Warning: Cannot tell what pointer points to, assuming global memory space
The $Lt_1_1794
token at line 53 of test.ptx shows the beginning of the while loop of the inlined myCopyFunction()
function. From line 56 to 59, there is the body of the loop and from 61 to 63 there is the looping condition. @%p1 bra $Lt_1_1794;
is a conditional branch instruction because @%p1
denotes a condition on the register %p1
.
The faulty instruction is at line 58: ld.global.s8
. As the compiler warned, it assumed global memory. The good instruction would have been ld.local.s8
. This problem is solved in the Fermi architecture thanks to unified pointers. The following example is the same as previous one except the PTX2 target.
5 __device__ void myCopyFunction(void* dest_ptr,
void* src_ptr, size_t size) {
6 char* dest = (char*)dest_ptr;
7 char* src = (char*)src_ptr;
8
9 while(size-- > 0) {
10 *dest++ = *src++;
11 }
12 }
13
14 __global__ void testKernel1(int* arg, void* mem)
15 {
16 int temp = 20;
17 *(void**)mem = (void*)&temp;
18 }
19
20 __global__ void testKernel2(int* arg, void* mem)
21 {
22 myCopyFunction(arg, *(void**)mem, sizeof(int));
23 }
88 $LDWbegin__Z11testKernel2PiPv:
89 .loc 17 6 0
90 ld.param.u64 %rd1,
[__cudaparm__Z11testKernel2PiPv_arg];
91 .loc 17 7 0
92 ld.param.u64 %rd2,
[__cudaparm__Z11testKernel2PiPv_mem];
93 ldu.global.u64 %rd3, [%rd2+0];
94 mov.s64 %rd4, 3;
95 $Lt_2_1794:
96 //<loop> Loop body line 7, nesting depth: 1,
estimated iterations: unknown
97 .loc 17 10 0
98 add.u64 %rd3, %rd3, 1;
99 add.u64 %rd1, %rd1, 1;
100 ld.s8 %r1, [%rd3+-1];
101 st.global.s8 [%rd1+-1], %r1;
102 .loc 17 9 0
103 sub.u64 %rd4, %rd4, 1;
104 mov.u64 %rd5, -1;
105 setp.ne.u64 %p1, %rd4, %rd5;
106 @%p1 bra $Lt_2_1794;
107 .loc 17 23 0
108 exit;
109 $LDWend__Z11testKernel2PiPv:
110 } // _Z11testKernel2PiPv
In the PTX 2.1 version (see [PTX21]), at line 100, the previous ld.global.s8
was changed into ld.s8
and the compiler does not complain anymore about the pointer.