Blocks in CUDA operate semi-independently. There is no safe way to synchronize them all. However, it does not mean that they cannot interact with each other in any way.
Consider a grid working on some task, e.g. a parallel reduction. Initially, each block can do its work independently, producing some partial result. At the end however, the partial results need to be combined and merged together. A typical example is a reduction algorithm on a big data.
A typical approach is to invoke two kernels, one for the partial computation and the other for merging.
However, if the merging can be done efficiently by a single block, only one kernel call is required.
This is achieved by a lastBlock
guard defined as:
__device__ bool lastBlock(int* counter) {
__threadfence(); //ensure that partial result is visible by all blocks
int last = 0;
if (threadIdx.x == 0)
last = atomicAdd(counter, 1);
return __syncthreads_or(last == gridDim.x-1);
}
__device__ bool lastBlock(int* counter) {
__shared__ int last;
__threadfence(); //ensure that partial result is visible by all blocks
if (threadIdx.x == 0) {
last = atomicAdd(counter, 1);
}
__syncthreads();
return last == gridDim.x-1;
}
With such a guard the last block is guaranteed to see all the results produced by all other blocks and can perform the merging.
__device__ void computePartial(T* out) { ... }
__device__ void merge(T* partialResults, T* out) { ... }
__global__ void kernel(int* counter, T* partialResults, T* finalResult) {
computePartial(&partialResults[blockIdx.x]);
if (lastBlock(counter)) {
//this is executed by all threads of the last block only
merge(partialResults,finalResult);
}
}
Assumptions:
lastBlock
function is invoked uniformly by all threads in all blocksT
names any type you like, but the example is not intended to be a template in C++ senseConsider an array of work items. The time needed for an each work item to complete varies greatly. In order to balance the work distribution between blocks it may be prudent for each block to fetch the next item only when previous one is complete. This is in contrast to a-priori assigning items to blocks.
class WorkQueue {
private:
WorkItem* gItems;
size_t totalSize;
size_t current;
public:
__device__ WorkItem& fetch() {
__shared__ WorkItem item;
if (threadIdx.x == 0) {
size_t itemIdx = atomicAdd(current,1);
if (itemIdx<totalSize)
item = gItems[itemIdx];
else
item = WorkItem::none();
}
__syncthreads();
return item; //returning reference to smem - ok
}
}
Assumptions:
WorkItem
is a small representation of the work assignment, e.g. a pointer to another objectWorkItem::none()
static member function creates a WorkItem
object that represents no work at allWorkQueue::fetch()
must be called uniformly by all threads in the blockWorkQueue::fetch()
without another __syncthreads()
in between. Otherwise a race condition will appear!The example does not include how the initialize the WorkQueue
or populate it. It is done by another kernel or CPU code and should be pretty straight-forward.