nVIDIA GPU architecture

Vincent Jordan (KDE lab.)

GPGPU (Many-Core) promises speedup that can reach an order of magnitude over current CPU (Multicore) architectures. GPU computing is as quite new phenomenon (as of 2010). Previously these processing units were dedicated to 2D/3D rendering and some specifically wired video acceleration. 3D rendering standard specifications (OpenGL and DirectX) included new features at each new version, especially about shaders capabilities, thus GPU became suitable for general purpose stream processing.

Memory accesses are among the slowest operations of a processor, due to the fact that Moore's law has increased instruction performance at a much greater rate than it has increased memory performance. This difference in performance increase rate means that memory operations have been getting expensive compared to simple register-to-register instructions. Modern CPUs sport large caches in order to reduce the overhead of these expensive memory accesses.
GPUs use another strategy so as to cope with this issue. Massive parallelism can "feed" the GPU with enough computational operations while waiting for pending memory operations to finish. This different execution strategy implies to look for new implementation ideas.
The GPU is especially well-suited to address problems that can be expressed as data-parallel computation (the same program is executed on many data elements in parallel) with high arithmetic intensity (the ratio of arithmetic operation to memory operations). This architecture was designed for image rendering (3D) and processing (video playback) but data-parallel processing can be also found in physics simulation, signal processing, computational finance or biology. An algorithm that is data-parallel is also referred as embarrassingly parallel. Those algorithms can be accelerated radically using GPU.

Since GPU implementation is still highly dependant to the underlying hardware, the purpose of this chapter is to show the evolution of the graphic hardware which explains its current limitations.

Evolution of the hardware architecture §

This section will explain what happened to graphic processing units with an eye to become a suitable architecture for general purpose parallel processing.
The graphics adapter in PC compatibles computers started from a simple memory-mapped frame buffer. They became devices with 2D and 3D hardware acceleration. These devices contain their own memory for the frame buffer. This memory has two accesses: computer system can read/write picture into it and video out hardware reads it in order to create a signal for the display device(s).
A fast historical review is showed in the following timeline figure (evolution of Direct3D pipeline is based on [THOMSON06]):

fig Evolution of the NVidia GPU architecture to GPGPU

The first generations of nVIDIA GPU chips are based on a 3D graphics pipeline and a set of fixed-functions. Those fixed-functions can be selected by the host and are executed by hardware.
NV1 is the first product of nVIDIA (released in 1995) and uses its own homemade graphic pipeline based on quadratic surfaces.
NV3 is the second product of nVIDIA (1997) and uses a totally different pipeline based on polygon surfaces so that it can match Microsoft DirectX 5.0 requirements.
NV10 is the third generation of nVIDIA GPU (1999). New fixed-functions have been added for DirectX 7.0 compliance, such as hardware Transform and Lighting (T&L) functions.
Each new version of DirectX introduces new fixed-functions requirements. The explosion of combinations of a fixed-function model became unwieldy therefore DirectX 8.0 introduced vertex and pixel shader models. This feature allows developers to create arbitrary programs to execute per-vertex or per-pixel. Since pixel shader (also called fragment shader) does not provide any branching or looping features in any version of the instruction set, this paragraph will focus on vertex shaders.

1.1 2.0 2.x 3.0

instruction count 128 256 ≥ 256 ≥ 512

Maximum instruction count according to
the shader model

A vertex shader is a program for "a vector-oriented CPU with an instruction set and sets of registers used to carry out the instructions". These tiny programs are written in a specific RISC-oriented assembly language. Some fixed-function elements can be implemented via vertex shader, but the purpose of this model is to produce results that are impossible to get through the fixed-function pipeline.
Vertex shader model 1.1 introduced by Direct3D 8.0 is the simplest architecture. The instruction sets contains operator for declarations, basic arithmetic, matrix arithmetic, simple comparison and basic lighting calculations. It provides no operators for conditional branching or flow control. There are several kind of registers: input, constant, output and temporary. Registers store 4 dimensional vector value of single precision floating-point number (approx. 6 decimal digits). Temporary registers provide a vertex shader with a small scratchpad for storing intermediate results.
2.0 2.x 3.0

call nesting 1 1-4 4

static conditions 16 16 24

dynamic conditions n/a 0-24 24

loop nesting 1 1-4 4

static flow count 16 16 ∞

Flow control limitation according to
the shader model

The core runtime introduced by DirectX 9.0 contained significant enhancements to the shader models, beyond of what new GPU were capable at the time DirectX 9.0 was released. This may explain the a, b and c differentiation of this version.
Vertex shader model 2.0 allows integer and boolean 4 dimensional vector values in constant registers, but the main improvement is the addition of static flow control to the execution model. Subroutines, branching and looping instructions are appended to the instruction set, but the usage of these new instructions has many limitations such as static condition only. In static flow control, all the conditional expressions for evaluating branch points refer to values that remain constant for the duration of the shader. Additionally loops execute a fixed number of times and conditional execution always follows the same path far all primitives drawn with the same set of constants. Different batches of primitives can have different flow control behavior by changing the constants between batches. New constant registers (loop counter register) are provided for defining the constants used for the flow control.
Vertex shader model 2.x is an intermediate execution architecture which provides some optional improvements. The most important new features from GPGPU point of view are predication, deeper nesting of static flow control instructions and dynamic flow control instructions. Predication is a form of conditional execution that can be applied to individual instructions without branching, for that purpose new predicate register is created for conditional flow control.
Vertex shader model 3.0 relaxes architectural limits of previous models (especially looping possibilities). This model revision is still fully oriented toward graphic rendering. Some projects got through the numerous development difficulties and cheated the rendering pipeline so as to use the GPU as a stream processor. Brook and LibSH are most known examples of such projects [BUCK04], [LIBSH].
The Brook framework was developed since the beginning as a language for streaming processing. The language supports several dedicated streaming processors. When GPUs have been capable of stream processing too, the Brook language has been adapted to the capabilities of graphics hardware such as ATI Radeon X800XT and nVIDIA GeForce 6800 (NV40, DirectX 9.0c). Since the language maps to several streaming architectures, it is free of any explicit graphics constructs unlike other high-level shader languages such as Cg/HLSL and GLslang.
Compute capability 1.0 refers to the generalization of the vertex shader as general purpose program of stream processing (this is an nVIDIA specific term). The nVIDIA CUDA toolkit was released simultaneously and made GPGPU computing more straightforward. From DirectX 10 point of view: this new revision introduces the unification of vertex, geometry and fragment processing into unified shader. DirectX 10 compliant GPU are definitely ready for general purpose parallel computation.
Compute capability 1.1 and beyond are not coupled to DirectX requirements anymore. The main improvement is the availability of atomic functions. For instance, this feature is especially important for the histogram CUDA example.
Compute capability 1.3 improves atomic functions possibilities and double-precision unit. Some of the limits of the Tesla architecture are revised (register memory, maximum number of warps and threads per SM).
Compute capability 1.2 is not a mistake: CC1.3 was released before CC1.2. Actually the main difference between the two compute capabilities is the lack of double-precision unit in 1.2-capable GPU.
Compute capability 2.0 corresponds to the Fermi architecture. GPGPU features of the nVIDIA GPU architectures undertook big efficiency/programmability changes. New threads synchronization functions are added.

	1.1	2.0	2.x	3.0
instruction count	128	256	≥ 256	≥ 512

	2.0	2.x	3.0
call nesting	1	1-4	4
static conditions	16	16	24
dynamic conditions	n/a	0-24	24
loop nesting	1	1-4	4
static flow count	16	16	∞

fig GPU-CPU convergence?

The "Little law" from Sylvain Collange: data = throughput × latency
The evolution of GPU capability toward more general purpose computation and the evolution of CPU capability toward more parallel and specific computation creates figure found beside: a convergence of CPU and GPU.
CPUs have good latency thanks to wide and coherent caches while GPUs have good throughput thanks to massive and hardware-handled parallelism.

Tesla hardware architecture §

Telsa is the product name of the first GPGPU-only device made by nVIDIA. It also became the name of the first CUDA-compliant GPU architecture. All GPU having compute capability from 1.0 to 1.3 belong to the Tesla architecture despite using other product name than Tesla (i.e. GeForce, Quadro).

fig Hardware architecture of the GT200 chip

Unlike current CPU dies, GPU dies do not feature big cache memory areas. Most of the transistors are used for computation. Processing units are divided and nested into several groups and subgroups. From die point of view, the main division of the Tesla architecture is the Thread Processing Clusters (TPC) [WONG10]. Each TPC is linked to an interconnection network (probably of crossbar type according to [COLLANGE10]) which also links DRAM memory controllers. Memory is partitioned into 1 to 8 controllers. Memory chips are found outside the GPU chip. Each TPC contains 2 or 3 stream processors (SM, also named multiprocessors).

fig Example hardware of nVIDIA Tesla GPU architecture

Using the GPU from host for general purpose computation is generally performed through three steps: data copy from host to device memory, execution on device, data copy from device to host. This reveals the first issue of GPGPU computing: the time wasted in memory transfers.

A GPU is connected to the host through a high-speed bus such as PCI-Express 16x. This bus is mainly used for DMA transfers between CPU and GPU memories since none of them is able to directly address the memory space of the other.
There are several options for data transfer:

Paged memory: When memory is allocated through malloc() function, the memory is paged. This mechanism allows the OS to allocate more memory than physically available. Pages of memory are swapped on-the-fly between hard drive and main host memory. When performing a memory copy between host and device, device will have to poll the CPU to know when the memory transfer has ended (source [FARIAS]).
Page-locked memory (aka pinned memory): PL memory is opposite of pageable memory. Disabling pageable memory feature lower the available CPU memory, but has some benefits: memory transfers can be performed concurrently with program execution on GPU (avoiding polling). Additionally, memory transfers are faster if the system uses a frontside bus (FSB) according to [CUDA32]. Please note that not all GPU support this feature.
Mapped memory (aka zero-copy memory): Since the compute capability 1.1, page-locked memory can be mapped into device address space. This memory region can be addressed by host and device using different addresses. This method is called zero-copy because data transfers between host and device are implicitly performed on-the-fly.
Write-combining: WC improves host-to-device write performance. Individual small writes are merged into one greater write transaction (burst). Defining a memory region as WC memory has two main drawbacks: WC does not guarantee that the combination of writes and reads is performed in the correct order. Reading from WC memory from the host is much slower than cacheable memory. This comes from the fact that writes to WC memory are delayed in an internal buffer and this buffer is neither cached (i.e., slow read) nor snooped (i.e., no data coherency).
This option is available since Intel P6 family processor [WCOMB].

When data and program has been loaded into GPU memory, execution can start since the number of stream multiprocessors and thread per SM is decided in advance. On the Tesla architecture, a SM can have 768 or 1024 threads simultaneously active according to the chip. The number of SM depends on the GPU: entry-level GPU has a few SM while high-end GPU has a lot of them. The GPU of the example has 24 multiprocessors therefore it can deal with 24576 active threads. Of course, all threads are not executed in the same time since a multiprocessor has only eight cores, so-called CUDA cores. Following the same example, 192 threads are executed at each clock cycle. In the worst case, 128 cycles are required to execute one instruction of all threads. The GPU can perform zero-cost hardware-based context switching and immediately switch to another thread to process.

A big difference between CPU and GPU is the memory hierarchy. Registers and shared memory are extremely fast while global memory is several hundred times slower. It is worth noting that the shared memory is not a hardware cache, but is a scratchpad memory [W-SCRPAD]. Each SM has a such local high-speed internal memory. It can be seen as a L1 cache, but there are crucial differences: explicit instruction are required to move data from global memory to shared memory and there is no coherency among scratchpads and global memory. When used as a cache memory, if global memory changes scratchpad's content will not be updated. Shared memory is considered as fast as register memory as long as there are no bank conflicts among threads.
Global memory is linked to the GPU chip through a very large data path: up to 512-bits wide. Through a such bus width, sixteen consecutive 32-bits words can be fetched from global memory in a single cycle. The Quadro FX4800 of my running example has a 384-bits bus width and GDDR3 memory at 1.6 GHz allowing a theoretical 76.8 GB/s memory bandwidth. In real programs, this bandwidth is difficult to obtain since memory has to be aligned and threads accesses have to be coalesced. Coalescing allows to merge all independent thread memory access into one big access. Several thread access patterns are recognized by coalescing hardware. It also means there is severe bandwidth degradation for stridden accesses.

CUDA cores can handle integer and single-precision floating operations. When a thread encounters a transcendental and double precision operations, CUDA cores cannot be used. There are two special function units for transcendental operations and one double precision unit for double precision operation. Transcendental operation are, for example, sine, cosine or logarithm instructions. SFU also compute integer multiplication when 32-bits precision is required (instead of 24-bits). Obviously, in a such case, the SM cannot perform more than two or one operation per cycle. Tesla generation GPU which support double-precision are eight times slower in double- than single-precision [COLLANGE10]. According to nVIDIA documentation [GTX200], "double-precision performance of all 10 TPCs of a GeForce GTX 280 GPU is roughly equivalent to an eight-core Intel Xeon CPU, yielding up to 78 gigaflops".

Fermi hardware architecture §

While the product name of CUDA-only device remained Tesla, its latest architecture's name changed to Fermi. All GPU having compute capability from 2.0 and higher (and below 3.0 probably) are based on the Fermi architecture.

fig Hardware architecture of the GF100 chip

Fermi architecture introduces a 768KB fully coherent L2 cache common to all SM (referred as unified cache). This new cache has the same purpose as CPU L2 cache and is a significant change from previous architecture. This key feature shows a real general purpose orientation of the nVIDIA GPU. This cache does not accelerate graphics computation.
Another non-graphical feature is that all memories (from register to DRAM) can be protected by ECC (Error-Correcting Code).
Thread Processing Clusters division disappeared.

fig Hardware of nVIDIA Fermi GPU architecture

Unified address space enables full C++ support [FERMI]. In Tesla architecture, linear memory was addressed using a 32-bit address space. Load/store instructions were specific to different memory area. Fermi architecture uses a 40-bit unified address space. Load/store instructions can use 64-bit address width for future growth.

The previous shared memory space became a configurable scratchpad/L1 cache. Fermi architecture has two configuration option: 48 KB of shared memory and 16 KB of L1 cache or 16 KB of shared memory and 48 KB of cache.

This new architecture upgrades memory from GDDR3 to GDDR5. Despite its narrower memory bus width (256 vs 384-bit), the card of the figure has a theoretical memory bandwidth of 102.6 GB/s.

Devices of compute capability 2.x can perform a copy from page-locked host memory to device memory concurrently with a copy from device memory to page-locked host memory because GPUs of the Fermi architecture have 2 copy-engines. Since PCIexpess bus is duplex, total bandwidth is doubled in a such case.

Software mapping §

CUDA stands for Compute Unified Device Architecture. CUDA is an extension of the C language for GPGPU computing. The programming model is tightly coupled with the architecture of nVIDIA graphic cards. Concept of the programming model can be mapped to hardware implementation.

 1 /* GPU kernel */
 2 __global__ void vecAdd(float* a, float* b, float* c) {
 3 	int i = blockDim.x * blockIdx.x + threadIdx.x;
 4 	c[i] = a[i] + b[i];
 5 }
 6 
 7 /* function relying on GPU */
 8 void parallelAdd(int size, float* h_a, float* h_b, float* h_c) {
 9 	size_t vector_byte_size = size * sizeof(float);
10 	float *d_a, *d_b, *d_c;
11 	
12 	/* STEP 1: memory allocation for GPU exec. */
13 	/* allocate some GPU memory and copy input vectors */
14 	cudaMalloc(&d_a, vector_byte_size);
15 	cudaMalloc(&d_b, vector_byte_size);
16 	cudaMalloc(&d_c, vector_byte_size);
17 	cudaMemcpy(d_a, h_a, vector_byte_size, cudaMemcpyHostToDevice);
18 	cudaMemcpy(d_b, h_b, vector_byte_size, cudaMemcpyHostToDevice);
19 	
20 	/* STEP 2: execution on device */
21 	vecAdd<<<2,size/2>>>(d_a, d_b, d_c);
22 	
23 	/* STEP 3: results retrieval */
24 	/* copy GPU result to CPU memory and free GPU memory */
25 	cudaMemcpy(h_c, d_c, vector_byte_size, cudaMemcpyDeviceToHost);
26 	cudaFree(d_a);
27 	cudaFree(d_b);
28 	cudaFree(d_c);
29 }
30 
31 /* function relying on CPU */
32 void serialAdd(int size, float* a, float* b, float* c) {
33 	int i;
34 	for(i=0; i<size; i++) {
35 		c[i] = a[i] + b[i];
36 	}
37 }
38 
39 int main()
40 {
41 	int vector_size = 102;
42 	size_t vector_byte_size = vector_size * sizeof(float);
43 	float *a, *b, *c;
44 	
45 	a = (float*)malloc(vector_byte_size);
46 	b = (float*)malloc(vector_byte_size);
47 	c = (float*)malloc(vector_byte_size);
48 	
49 	/* put some values into a and b vectors */
50 	[...]
51 	
52 	/* serialAdd(vector_size, a, b, c); */
53 	parallelAdd(vector_size, a, b, c);
54 	
55 	/* now c = a + b and c can be used */
56 	[...]
57 	
58 	free(a);
59 	free(b);
60 	free(c);
61 	
62 	return 0;
63 }

Example of CUDA integration

The code example on the right will be used to show the mapping between CUDA programming model and GPU hardware execution model. This example also shows how a part of a software can be accelerated on GPGPU using the CUDA toolkit.
From line 1 to 5, there is the GPU code (aka. device). Everything else is executed on CPU (aka. host). parallelAdd() function is an ad hoc replacement of the serialAdd() function. parallelAdd() deals with all GPU specificities. As explained previously, the execution is performed through three steps. In the first step, cudaMalloc() and cudaMemcpy() allocate and load the required input data in GPU global memory. In this parallelAdd() function, there are both host (start with h) and device (start with d) pointers. The second step is kernel call. A kernel is a program executed on GPU. The calling syntax use a CUDA specific notation: <<<2, size/2>>> specifies the number of block (2) and the number of thread per block (size/2).

Like processing units in hardware, threads are also divided into groups. The following schema shows one grouping possibility. As for Grid and Block sizes, they are defined by developer at Kernel execution. Warp size is fixed by hardware and is 32 for both Tesla and Fermi, but this size might change in future architectures according to nVIDIA CUDA documentation.

Grouping possibilities for 102 threads are <<<1, 102>>>, <<<2, 51>>>, <<<3, 34>>> and <<<6, 17>>>. Since the basic unit of execution flow in a multiprocessor is a warp of 32 thread, it is useless to execute less that 32 threads in a block. Actually, if there was a condition so as to prevent from going out of bound of the vector arrays when the thread number exceed vector size, we could use any <<<x, ⌈size/x⌉>>>.

fig Example of 102 threads divided into 2 blocks

The main difference between Tesla and Fermi architecture is in the way they execute warps of threads.
On a Tesla, the eight cores execute a warp in four clock cycles (8 by 8). A Fermi multiprocessor, each group of 16 cores execute a different warp. Therefore, two warps are executed in two clock cycles (2×16 by 2×16).

A question about divergent threads rises from this execution strategy: What happens when threads do not execute the same code in a warp?.
In the following example, execution path depends on the thread id. This case is handled differently on Tesla and Fermi.

 1 /* GPU kernel */
 2 __global__ void vecAdd(float* a, float* b, float* c) {
 3 	int i = blockDim.x * blockIdx.x + threadIdx.x;
 4 	
 5 	if(i & 1) { /* if i is odd */
 6 		c[i] = a[i] + b[i];
 7 	} else {
 8 		c[i] = a[i] - b[i];
 9 	}
10 }

A kernel with divergent threads

On the Tesla architecture, each conditional branch is serialized. According to [WONG10], "else clause" is always executed first while other clauses are disabled, then "if clause" is executed (and "else clause" disabled).
Fermi architecture improves this issue because each multiprocessor features a dual warp scheduler. Each group of 16 cores can execute a different conditional branch. Our divergent thread example would be executed in parallel on Fermi architecture only. Of course, it works for half-warps only.

CUDA compiler allocates registers memory to threads. If the threads requires too many registers, local memory is used. Actually local memory does not exist in the hardware. Local memory is the name given to some global memory which was made private to a thread. This memory is extremely slow compared to register or shared memory, thus exceeding the maximum register memory lead to dramatically slow performances.

On devices of compute capability 2.x, function call is available. The size of the call stack can be queried using cudaThreadGetLimit() and set using cudaThreadSetLimit().

CPU thread vs GPU thread §

Despite using the same name, the word thread has a different definition on CPU or on GPU and can lead to misunderstanding. Unlike in CPU, GPU threads are managed by hardware. Classical thread programming techniques do not match GPU thread design. Translation from POSIX thread (Linux) or Windows thread models to CUDA thread is not obvious nor recommended. I quote this analogy since I think it helped me get the difference more clearly "the definition that applies to CUDA is threads in a fabric running lengthwise" [PPBLOG]. In other words, CUDA threads should not diverge for optimal performances. Divergent threads are not impossible to implement, but they can dramatically lower the performances.

Single-Instruction Multiple-Thread (SIMT) is the name given by nVIDIA to this execution strategy. Every thread in the same warp execute the same instruction in lockstep, but all threads can branch separately although it would lead to extremely bad performances even on the Fermi architecture.

What about memory consistency? §

Both Tesla and Fermi architectures have several memory areas having different memory consistency models. CUDA toolkit is very system-centric (according to [GHARACH95] definition) and memory management can quickly become a real nightmare for programmers.

A state space is a storage area with particular characteristics.

name	addressable	access	cost	lifetime	description
.reg	No	R/W	0	thread	registers
.sreg	No	RO	0	thread	special registers (platform-specific)
.const	Yes	RO	0*	application	constant memory
.global	Yes	R/W	> 100	application	global memory
.local	Yes	R/W	> 100	thread	local memory (private to each thread)
.param	Yes	RO	0		user parameters for a program
.shared	Yes	R/W	0	block	shared memory
.surf	via surface instruction	R/W	> 100	application	surface memory (global memory)
.tex	via texture instruction	RO	> 100	application	texture memory (global memory)

State spaces as defined in PTX 1.4 [PTX14]
* constant memory cost is amortized

Each addressable state space has its own address space. It means that memory load instruction is different according to the memory space. For each dereferencing pointer code, nVIDIA compiler infers at compilation time which memory space is the right space. This process may fail like in the example below.

 5 __device__ void myCopyFunction(void* dest_ptr,
 void* src_ptr, size_t size) {
 6 	char* dest = (char*)dest_ptr;
 7   	char* src = (char*)src_ptr;
 8 
 9 	while(size-- > 0) {
10 		*dest++ = *src++;
11 	}
12 }
13 
14 __global__ void testKernel1(int* arg, void* mem)
15 {
16 	int temp = 20;
17 	*(void**)mem = (void*)&temp;
18 }
19 
20 __global__ void testKernel2(int* arg, void* mem)
21 {
22 	myCopyFunction(arg, *(void**)mem, sizeof(int));
23 }

46 $LDWbegin__Z11testKernel2PiPv:
47 	.loc	17	6	0
48 	ld.param.u64 	%rd1,
[__cudaparm__Z11testKernel2PiPv_arg];
49 	.loc	17	7	0
50 	ld.param.u64 	%rd2,
[__cudaparm__Z11testKernel2PiPv_mem];
51 	ld.global.u64 	%rd3, [%rd2+0];
52 	mov.s64 	%rd4, 3;
53 $Lt_1_1794:
54  //<loop> Loop body line 7, nesting depth: 1,
 estimated iterations: unknown
55 	.loc	17	10	0
56 	add.u64 	%rd3, %rd3, 1;
57 	add.u64 	%rd1, %rd1, 1;
58 	ld.global.s8 	%rh1, [%rd3+-1];
59 	st.global.s8 	[%rd1+-1], %rh1;
60 	.loc	17	9	0
61 	sub.u64 	%rd4, %rd4, 1;
62 	mov.u64 	%rd5, -1;
63 	setp.ne.u64 	%p1, %rd4, %rd5;
64 	@%p1 bra 	$Lt_1_1794;
65 	.loc	17	23	0
66 	exit;
67 $LDWend__Z11testKernel2PiPv:
68 	} // _Z11testKernel2PiPv

Comparison between CUDA source and PTX 1.4

The following warning is generated at compilation time:
./test.cu(10): Warning: Cannot tell what pointer points to, assuming global memory space
The $Lt_1_1794 token at line 53 of test.ptx shows the beginning of the while loop of the inlined myCopyFunction() function. From line 56 to 59, there is the body of the loop and from 61 to 63 there is the looping condition. @%p1 bra $Lt_1_1794; is a conditional branch instruction because @%p1 denotes a condition on the register %p1.

The faulty instruction is at line 58: ld.global.s8. As the compiler warned, it assumed global memory. The good instruction would have been ld.local.s8. This problem is solved in the Fermi architecture thanks to unified pointers. The following example is the same as previous one except the PTX2 target.

 5 __device__ void myCopyFunction(void* dest_ptr,
 void* src_ptr, size_t size) {
 6 	char* dest = (char*)dest_ptr;
 7   	char* src = (char*)src_ptr;
 8 
 9 	while(size-- > 0) {
10 		*dest++ = *src++;
11 	}
12 }
13 
14 __global__ void testKernel1(int* arg, void* mem)
15 {
16 	int temp = 20;
17 	*(void**)mem = (void*)&temp;
18 }
19 
20 __global__ void testKernel2(int* arg, void* mem)
21 {
22 	myCopyFunction(arg, *(void**)mem, sizeof(int));
23 }

 88 $LDWbegin__Z11testKernel2PiPv:
 89 	.loc	17	6	0
 90 	ld.param.u64 	%rd1,
[__cudaparm__Z11testKernel2PiPv_arg];
 91 	.loc	17	7	0
 92 	ld.param.u64 	%rd2,
[__cudaparm__Z11testKernel2PiPv_mem];
 93 	ldu.global.u64 	%rd3, [%rd2+0];
 94 	mov.s64 	%rd4, 3;
 95 $Lt_2_1794:
 96  //<loop> Loop body line 7, nesting depth: 1,
 estimated iterations: unknown
 97 	.loc	17	10	0
 98 	add.u64 	%rd3, %rd3, 1;
 99 	add.u64 	%rd1, %rd1, 1;
100 	ld.s8 	%r1, [%rd3+-1];
101 	st.global.s8 	[%rd1+-1], %r1;
102 	.loc	17	9	0
103 	sub.u64 	%rd4, %rd4, 1;
104 	mov.u64 	%rd5, -1;
105 	setp.ne.u64 	%p1, %rd4, %rd5;
106 	@%p1 bra 	$Lt_2_1794;
107 	.loc	17	23	0
108 	exit;
109 $LDWend__Z11testKernel2PiPv:
110 	} // _Z11testKernel2PiPv

Comparison between CUDA source and PTX 2.1

In the PTX 2.1 version (see [PTX21]), at line 100, the previous ld.global.s8 was changed into ld.s8 and the compiler does not complain anymore about the pointer.

xhtml valid? | css valid? | last update on September 2010