When implementing high-performance algorithms on NVIDIA GPUs, the CUB library is essential. However, its traditional 'two-phase' API, which separates temporary storage sizing from allocation, often leads to cluttered code. This post dives into the new single-call API available from CUDA 13.1 and how it improves the developer experience. For the original announcement, see the source material on the NVIDIA developer blog.

Server rack with NVIDIA GPU computing hardware

The Problem with the Two-Phase API

The old pattern required two calls: first to determine temporary storage size, then to allocate and execute.

// FIRST CALL: determine temporary storage size
cub::DeviceScan::ExclusiveSum(nullptr, temp_storage_bytes, d_input, d_output, num_items);
// Allocate the required temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// SECOND CALL: run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_input, d_output, num_items);

While offering flexibility, this approach resulted in repetitive code and ambiguity about which parameters could change between calls.

Laptop displaying CUDA C++ code for GPU programming

Benefits of the New Single-Call API

The new API condenses everything into one line. It handles allocation internally with zero performance overhead.

FeatureLegacy Two-Phase APINew Single-Call API
Call Count2 (Size Query + Execution)1
Explicit AllocationRequired (cudaMalloc)Not Required (Internal)
Code ReadabilityLow (Boilerplate)High
Performance OverheadNoneNone
Execution ControlLimitedFlexible via env argument

The new API maintains flexibility through the env argument, allowing you to pass memory resources or a CUDA stream.

Data analysis and algorithm visualization on a screen

Practical Tips and Conclusion

Key algorithms like cub::DeviceReduce::Sum and cub::DeviceScan::ExclusiveSum now support the new interface. Using the env argument, you can finely control the execution environment by combining custom memory pools or specific streams. This new standard allows you to write cleaner code without sacrificing performance. Start using it today with CUDA 13.1 or later.