Category Archives: CUDA

CUDA Image Compaction

Left: import image; Middle: compacted non-black pixels before sort; Right: compacted pixels sorted by descending luminance values.

This is a test performing stream compaction using CUDA. Stream Compaction is the process to locate sparsely and randomly populated useful elements within a large data stream, then relocate them into a tightly packed and better order.

The process can be described by a few steps:

1. Divide all elements into a number of groups. For each group, count the valid elements inside it.

2. Perform prefix sum (scan all groups) to find out start location of each group in the compacted data.

3. For each group, deduce the location of each valid element inside it, then relocate.

4. Sort the packed valid elements if necessary.

Each step is processed by a different CUDA kernel. For example:

__global__ void
 count_valid_elements(uchar4 *color, uint *num_valid, uint group_size)
 {
 uint pos = blockIdx.x * blockDim.x + threadIdx.x;
 uint count = 0;
 for(uint i = 0; i < group_size; i++)
 {
 uint pix_loc = pos * group_size + i;
 if(color[pix_loc].w > some_threshold)
 count++;
 }
 num_valid[pos] = count;
 }

This kernel is launch per-group, and counting is done sequentially. No shared memory is involved. Another example:

__global__ void
 relocate(uint *prefix_sum, uchar4 *color, uchar4 *data, uint group_size)
 {
 uint pos = blockIdx.x * blockDim.x + threadIdx.x;
 uint offset = prefix_sum[pos];
 for(uint i = 0; i < group_size; i++)
 {
 uint pix_loc = pos * group_size + i;
 if(color[pix_loc].w > some_threshold)
 {
 data[offset] = color[pix_loc];
 offset++;
 }
 }
 }

In the relocate kernel, each group already knows where it should start in the result data, so simply add up the offset for each valid element and write the memory. Collision (different threads write to the same location) will not happen.

Here I use luminance value of the pixel to decide if the element is valid. Pixels too dark will be eliminated, and the rest will be packed. If most pixels are valid, you can still see parts from the original image before sort.

Again, luminance value is used as the key of sort. To find out how many valid elements need to be sort is a bit tricky. The clue is offered by the last group. Add last valid count and last prefix sum, and you have the number of valid elements. Device memory access is provided by Thrust:

thrust::device_ptr<uint>scanPtr(scan);
thrust::device_ptr<uint>countPtr(count);
uint num_valid_element = scanPtr[num_group-1] + countPtr[num_group-1];

Those mostly dark images will be heavily reduced and packed.

So through all those difficulties, you have reduced a few beautiful images from HST into a mush of meaningless dots, what’s the point? Well, stream compaction is essential to many more interesting applications. You can use it to reduce irrelevant elements, say particles outside view frustum, so expensive calculation will be wasted on them. It is also useful to mesh subdivision/tessellation, collision detection, hierarchical data construction.  A few hints here.

Advertisements

Leave a comment

Filed under CUDA