GPUs architecture

Published by fvazquez on Sunday, December 13, 2009

The GPU architecture is composed by a set of processing units called streaming multiprocessor (SM), each SM contains 8 scalar processors (SP) or cores. The grouping of the SMs conforms the GPU and is called device.

Each SM contains:

  1. A set of registers of 32 bits per SP
  2. An read/write memory area shared among all SPs called shared memory
  3. Two read only memory areas (constant memory and texture memory) shared among all SPs of all SMs

In addition, all SMs belongs to the device shared an global memory area called device memory.

gpu_architecture.jpg

The following figure shows a diagram of the GPU architecture, consisting of 16 units SM and 8 SPs per SM.

The Thread Processor is the unit responsible for coordinate the execution within the SPs of SM, leading to an execution model called SIMT (Single Instruction Multiple Threads), wich is an extension to the SIMD model (Single Instruction Multiple Data) to be the thread responsible for accesing to differents data.

The execution of a CUDA program is based in the simultaneous execution of a batch of threads organized into one-dimensional, two-dimensional o three-dimensional units called thread blocks or simply blocks.

These blocks in turn are organized in other larger units (also with one, two or three dimensions) called grids. Internally, for the execution, the threads are organized into warps, wich are a set of 32 threads, so that the architecture can achive in 4 clock cycles (8 cores x 4) that the 32 threads complete the same operation on differents operands.

The figure below shows an example of the organization of 72 threads into 6 blocks of 12 threads each, and these again in a two-dimensional grid of 2 rows x 3 columns.

threadblocks.jpg

Each thread is identified by an index relative to the block and the grid to which it belongs, obtaining unique indentifiers within each block and between all the threads defined. These identifiers will decide the instructions that must be executed by each thread.

After defining the grids and blocks of the application, plans the execution of all blocks, distributed among all SMs, so that a complete block will run on a particular SM.

This block is further divided internally into warps which are the maximum cluster unit of concurrently threads that manages an SM. The SM creates, manages and executes the threads concurrently without any overhead associated with the planning of its implementation (no overhead in context switching between warps).

As the execution of the warps within a block are completed, the SM will allocate new blocks pending the execution of them all.

The memory is allocated as follows: Each thread has its own private local memory organized in 32-bit registers. In total there are 8 K register of 32-bit to be distributed among all the threads of each block (on devices with computing capapability 1.2 or higher expands the number of registers to 16 K).

Each thread within a block share a read/write memory area with high bandwidth and low latency called shared memory, this memory has a capacity of 16 KB. In addition there are two caches memorys called constant memory and texture memory.

The constant memory has a size of 8 KB per SM, making a total of 64 KB. The texture memory has a variable size from 6 KB to 8 KB per SM. Both the constant memory as texture memory, are read-only caches accessible by all threads in the grid. All SMs shared a read/write memory area called device memory, is used for sharing information among all threads.

The figure below shows the proposed memory hierarchy architecture.


memory.jpg