CUDA is a proprietary NVIDIA parallel computing technology and programming language for their GPUs.
GPUs are highly parallel machines capable of running thousands of lightweight threads in parallel. Each GPU thread is usually slower in execution and their context is smaller. On the other hand, GPU is able to run several thousands of threads in parallel and even more concurrently (precise numbers depend on the actual GPU model). CUDA is a C++ dialect designed specifically for NVIDIA GPU architecture. However, due to the architecture differences, most algorithms cannot be simply copy-pasted from plain C++ - they would run, but would be very slow.
The CUDA-enabled GPU processor has the following physical structure:
In addition, each SM features one or more warp schedulers. Each scheduler dispatches a single instruction to several CUDA cores. This effectively causes the SM to operate in 32-wide SIMD mode.
The physical structure of the GPU has direct influence on how kernels are executed on the device, and how one programs them in CUDA. Kernel is invoked with a call configuration which specifies how many parallel threads are spawned.
Each thread is identified by a block index blockIdx
and thread index within the block threadIdx
.
These numbers can be checked at any time by any running thread and is the only way of distinguishing one thread from another.
In addition, threads are organized into warps, each containing exactly 32 threads. Threads within a single warp execute in a perfect sync, in SIMD fahsion. Threads from different warps, but within the same block can execute in any order, but can be forced to synchronize by the programmer. Threads from different blocks cannot be synchronized or interact directly in any way.
In normal CPU programming the memory organization is usually hidden from the programmer. Typical programs act as if there was just RAM. All memory operations, such as managing registers, using L1- L2- L3- caching, swapping to disk, etc. is handled by the compiler, operating system or hardware itself.
This is not the case with CUDA. While newer GPU models partially hide the burden, e.g. through the Unified Memory in CUDA 6, it is still worth understanding the organization for performance reasons. The basic CUDA memory structure is as follows:
Blocks in CUDA operate semi-independently. There is no safe way to synchronize them all. However, it does not mean that they cannot interact with each other in any way.
Parallel reduction algorithm typically refers to an algorithm which combines an array of elements, producing a single result. Typical problems that fall into this category are:
In general, the parallel reduction can be applied for any binary associative operator, i.e. (A*B)*C = A*(B*C)
.
With such operator *, the parallel reduction algorithm repetedely groups the array arguments in pairs.
Each pair is computed in parallel with others, halving the overall array size in one step.
The process is repeated until only a single element exists.
If the operator is commutative (i.e. A*B = B*A
) in addition to being associative, the algorithm can pair in a different pattern.
From theoretical standpoint it makes no difference, but in practice it gives a better memory access pattern:
Not all associative operators are commutative - take matrix multiplication for example.
To install CUDA toolkit on Windows, fist you need to install a proper version of Visual Studio. Visual studio 2013 should be installed if you're going to install CUDA 7.0 or 7.5. Visual Studio 2015 is supported for CUDA 8.0 and beyond.
When you've a proper version of VS on your system, it's time to download and install CUDA toolkit. Follow this link to find a the version of CUDA toolkit you're looking for: CUDA toolkit archive
In the download page you should choose the version of windows on target machine, and installer type (choose local).
After downloading exe file, you shall extract it and run setup.exe
. When installation is complete open a new project and choose NVIDIA>CUDAX.X from templates.
Remember that CUDA source files extension is .cu
. You can write both host and device codes on a same source.