cuda.compute: Parallel Computing Primitives#
The cuda.compute library provides composable primitives for building custom
parallel algorithms on the GPU—without writing CUDA kernels directly.
Algorithms#
Algorithms are the core of cuda.compute. They operate on arrays or
iterators and can be composed to build specialized
GPU operations—reductions, scans, sorts, transforms, and more.
Typical usage of an algorithm looks like this:
cuda.compute.reduce_into(
d_in=..., # input array or iterator
d_out=..., # output array or iterator
op=..., # binary operator (built-in or user-defined)
num_items=..., # number of input elements
h_init=..., # initial value for the reduction
)
API conventions#
Naming — The
d_prefix denotes device memory (e.g., CuPy arrays, PyTorch tensors);h_denotes host memory (NumPy arrays). Some scalar values must be passed as host arrays.Output semantics — Algorithms write results into a user-provided array or iterator rather than returning them. This keeps memory ownership explicit and lifetimes under your control.
Operators — Many algorithms accept an
opparameter. This can be a built-inOpKindvalue or a user-defined function. When possible, prefer built-in operators (e.g.,OpKind.PLUS) over the equivalent user-defined operation (e.g.,lambda a, b: a + b) for better performance.Iterators — Inputs and outputs can be iterators instead of arrays, enabling lazy evaluation and operation fusion.
Full Example#
The following example uses reduce_into
to compute the sum of a sequence of integers:
"""
Sum all values in an array using reduction with PLUS operation.
"""
import cupy as cp
import numpy as np
import cuda.compute
from cuda.compute import OpKind
# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)
# Perform the reduction.
cuda.compute.reduce_into(d_input, d_output, OpKind.PLUS, len(d_input), h_init)
# Verify the result.
expected_output = 15
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Sum reduction result: {result}")
Controlling temporary memory#
Many algorithms allocate temporary device memory for intermediate results. For finer
control over allocation—or to reuse buffers across calls—use the object-based API.
For example, make_reduce_into
returns a reusable reduction object that lets you manage memory explicitly.
# create a reducer object:
reducer = cuda.compute.make_reduce_into(d_in, d_out, op, h_init)
# get the temporary storage size by passing None as the first argument:
temp_storage_bytes = reducer(None, d_in, d_out, num_items, h_init)
# allocate the temporary storage as any array-like object
# (e.g., CuPy array, Torch tensor):
temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8)
# perform the reduction, passing the temporary storage as the first argument:
reducer(temp_storage, d_in, d_out, num_items, h_init)
User-Defined Operations#
A powerful feature is the ability to use algorithms with user-defined operations.
For example, to compute the sum of only the even values in a sequence,
we can use reduce_into with a custom binary operation:
"""
Sum only even values in an array using reduction with custom operation.
"""
import cupy as cp
import numpy as np
import cuda.compute
# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)
# Define the binary operation for the reduction.
def add_op(a, b):
return (a if a % 2 == 0 else 0) + (b if b % 2 == 0 else 0)
# Perform the reduction.
cuda.compute.reduce_into(d_input, d_output, add_op, len(d_input), h_init)
# Verify the result.
expected_output = 6
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Custom sum reduction result: {result}")
Features and Restrictions#
User-defined operations are compiled into device code using Numba CUDA, so they inherit many of the same features and restrictions as Numba CUDA functions:
Python features and atomic operations supported by Numba CUDA are also supported within user-defined operations.
Nested functions must be decorated with
@numba.cuda.jit.Variables captured in closures or globals follow Numba CUDA semantics: scalars and host arrays are captured by value (as constants), while device arrays are captured by reference.
Iterators#
Iterators represent sequences whose elements are computed on the fly. They can be used in place of arrays in most algorithms, enabling lazy evaluation, operation fusion, and custom data access patterns.
A CountingIterator, for example,
represents an integer sequence starting from a given value:
it = CountingIterator(np.int32(1)) # represents [1, 2, 3, 4, ...]
To compute the sum of the first 100 integers, we can pass a
CountingIterator directly to
reduce_into. No memory is allocated
to store the input sequence—the values are generated as needed.
"""
Example showing how to use counting_iterator.
"""
import functools
import cupy as cp
import numpy as np
import cuda.compute
from cuda.compute import (
CountingIterator,
OpKind,
)
# Prepare the input and output arrays.
first_item = 1
num_items = 100
# Create the counting iterator.
first_it = CountingIterator(np.int32(first_item))
# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)
# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)
# Perform the reduction.
cuda.compute.reduce_into(first_it, d_output, OpKind.PLUS, num_items, h_init)
# Verify the result.
expected_output = functools.reduce(
lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")
Iterators can also be used to fuse operations. In the example below, a
TransformIterator lazily applies
the square operation to each element of the input sequence. The resulting iterator
is then passed to reduce_into to compute
the sum of squares.
Because the square is evaluated on demand during the reduction, there is no need to create or store an intermediate array of squared values. The transform and the reduction are fused into a single pass over the data.
"""
Using ``reduce_into`` with a ``TransformIterator`` to compute the
sum of squares of a sequence of numbers.
"""
import cupy as cp
import numpy as np
from cuda.compute import (
OpKind,
TransformIterator,
reduce_into,
)
# Prepare the input and output arrays.
d_input = cp.arange(10, dtype=np.int32)
d_output = cp.empty(1, dtype=np.int32)
h_init = np.array([0], dtype=np.int32) # Initial value for the reduction
# Create a TransformIterator to (lazily) apply the square
it_input = TransformIterator(d_input, lambda a: a**2)
# Use `reduce_into` to compute the sum of the squares of the input.
reduce_into(it_input, d_output, OpKind.PLUS, len(d_input), h_init)
# Verify the result.
expected_output = cp.sum(d_input**2).get()
assert d_output[0] == expected_output
print(f"Transform iterator result: {d_output[0]} (expected: {expected_output})")
Some iterators can also be used as the output of an algorithm. In the example below,
a TransformOutputIterator
applies the square-root operation to the result of a reduction before writing
it into the underlying array.
"""
TransformOutputIterator example demonstrating reduction with transform output iterator.
"""
import cupy as cp
import numpy as np
import cuda.compute
from cuda.compute import (
OpKind,
TransformOutputIterator,
)
# Create input and output arrays
d_input = cp.array([1, 2, 3, 4, 5.0], dtype=np.float32)
d_output = cp.empty(shape=1, dtype=np.float32)
# Define the transform operation to be applied
# to the result of the sum reduction.
def sqrt(x: np.float32) -> np.float32:
return x**0.5
# Create transform output iterator
d_out_it = TransformOutputIterator(d_output, sqrt)
# Apply a sum reduction into the transform output iterator
cuda.compute.reduce_into(
d_input,
d_out_it,
OpKind.PLUS,
len(d_input),
np.asarray([0], dtype=np.float32),
)
assert cp.allclose(d_output, cp.sqrt(cp.sum(d_input)), atol=1e-6)
As another example, ZipIterator combines multiple
arrays or iterators into a single logical sequence. In the example below, we combine
a counting iterator and an array, creating an iterator that yields (index, value)
pairs. This combined iterator is then used as the input to
reduce_into to compute the index of
the maximum value in the array.
"""
Example showing how to use zip_iterator with counting iterator to
find the index with maximum value in an array.
"""
import cupy as cp
import numpy as np
import cuda.compute
from cuda.compute import (
CountingIterator,
ZipIterator,
)
def max_by_value(p1, p2):
"""Reduction operation that returns the pair with the larger value."""
return p1 if p1[1] > p2[1] else p2
# Create the counting iterator.
counting_it = CountingIterator(np.int32(0))
# Prepare the input array.
arr = cp.asarray([0, 1, 2, 4, 7, 3, 5, 6], dtype=np.int32)
# Create the zip iterator.
zip_it = ZipIterator(counting_it, arr)
num_items = 8
# Note: initial value passed as a numpy struct
dtype = np.dtype([("index", np.int32), ("value", np.int32)], align=True)
h_init = np.asarray([(-1, -1)], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)
# Perform the reduction.
cuda.compute.reduce_into(zip_it, d_output, max_by_value, num_items, h_init)
result = d_output.get()[0]
expected_index = 4
expected_value = 7
assert result["index"] == expected_index
assert result["value"] == expected_value
print(
f"Zip iterator with counting result: index={result['index']} "
f"(expected: {expected_index}), value={result['value']} (expected: {expected_value})"
)
These examples illustrate a few of the patterns enabled by iterators. See the API reference for the full set of available iterators.
Struct Types#
The gpu_struct decorator defines
GPU-compatible struct types. These are useful when you have data laid out
as an “array of structures”, similar to NumPy structured arrays.
"""
Finding the maximum green value in a sequence of pixels using `reduce_into`
with a custom data type.
"""
import cupy as cp
import numpy as np
import cuda.compute
from cuda.compute import gpu_struct
# Define a custom data type to store the pixel values.
@gpu_struct
class Pixel:
r: np.int32
g: np.int32
b: np.int32
# Define a reduction operation that returns the pixel with the maximum green value.
def max_g_value(x, y):
return x if x.g > y.g else y
# Prepare the input and output arrays.
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
d_out = cp.empty(1, Pixel.dtype)
# Prepare the initial value for the reduction.
h_init = Pixel(0, 0, 0)
# Perform the reduction.
cuda.compute.reduce_into(d_rgb, d_out, max_g_value, d_rgb.size, h_init)
# Calculate the expected result.
h_rgb = d_rgb.get()
expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]
# Verify the result.
assert expected["g"] == d_out.get()["g"]
result = d_out.get()
print(f"Pixel reduction result: {result}")
Array of Structures vs Structure of Arrays#
When working with structured data, there are two common memory layouts:
Array of Structures (AoS) — each element is a complete struct, stored contiguously. For example, an array of
Pointstructs where each point’sxandyare adjacent in memory.Structure of Arrays (SoA) — each field is stored in its own array. For example, separate
x_coordsandy_coordsarrays.
cuda.compute supports both layouts:
``gpu_struct`` — defines a true AoS type with named fields
``ZipIterator`` — combines separate arrays into tuples on the fly, letting you work with SoA data as if it were AoS
Caching#
Algorithms in cuda.compute are compiled to GPU code at runtime. To avoid
recompiling on every call, build results are cached in memory. When you invoke
an algorithm with the same configuration—same dtypes, iterator kinds, operator,
and compute capability—the cached build is reused.
What determines the cache key#
Each algorithm computes a cache key from:
Array dtypes — the data types of input and output arrays
Iterator kinds — for iterator inputs/outputs, a descriptor of the iterator type
Operator identity — for user-defined functions, the function’s bytecode, constants, and closure contents (see below)
Compute capability — the GPU architecture of the current device
Algorithm-specific parameters — such as initial value dtype or determinism mode
Note that array contents or pointers are not part of the cache key—only the array’s dtype. This means you can reuse a cached algorithm across different arrays of the same type.
How user-defined functions are cached#
User-defined operators and predicates are hashed based on their bytecode, constants, and closure contents. Two functions with identical bytecode and closures produce the same cache key, even if defined at different source locations.
Closure contents are recursively hashed:
Scalars and host arrays — hashed by value
Device arrays — hashed by pointer, shape, and dtype (not contents)
Nested functions — hashed by their own bytecode and closures
Because device arrays captured in closures are hashed by pointer, changing the array’s contents does not invalidate the cache—only reassigning the variable to a different array does.
Memory considerations#
The cache persists for the lifetime of the process and grows with the number of unique algorithm configurations. In long-running applications or exploratory notebooks, this can accumulate significant memory.
To clear all caches and free memory:
import cuda.compute
cuda.compute.clear_all_caches()
This forces recompilation on the next algorithm invocation—useful for benchmarking compilation time or reclaiming memory.
Examples#
For complete runnable examples and additional usage patterns, see the examples directory.