When implementing high-performance algorithms on NVIDIA GPUs, the CUB library is essential. However, its traditional 'two-phase' API, which separates temporary storage sizing from allocation, often leads to cluttered code. This post dives into the new single-call API available from CUDA 13.1 and how it improves the developer experience. For the original announcement, see the source material on the NVIDIA developer blog.

The Problem with the Two-Phase API
The old pattern required two calls: first to determine temporary storage size, then to allocate and execute.
// FIRST CALL: determine temporary storage size
cub::DeviceScan::ExclusiveSum(nullptr, temp_storage_bytes, d_input, d_output, num_items);
// Allocate the required temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// SECOND CALL: run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_input, d_output, num_items);
While offering flexibility, this approach resulted in repetitive code and ambiguity about which parameters could change between calls.

Benefits of the New Single-Call API
The new API condenses everything into one line. It handles allocation internally with zero performance overhead.
| Feature | Legacy Two-Phase API | New Single-Call API |
|---|---|---|
| Call Count | 2 (Size Query + Execution) | 1 |
| Explicit Allocation | Required (cudaMalloc) | Not Required (Internal) |
| Code Readability | Low (Boilerplate) | High |
| Performance Overhead | None | None |
| Execution Control | Limited | Flexible via env argument |
The new API maintains flexibility through the env argument, allowing you to pass memory resources or a CUDA stream.

Practical Tips and Conclusion
Key algorithms like cub::DeviceReduce::Sum and cub::DeviceScan::ExclusiveSum now support the new interface. Using the env argument, you can finely control the execution environment by combining custom memory pools or specific streams. This new standard allows you to write cleaner code without sacrificing performance. Start using it today with CUDA 13.1 or later.