`cuda.compute`: Parallel Computing Primitives#

The cuda.compute library provides composable primitives for building custom parallel algorithms on the GPU—without writing CUDA kernels directly.

Algorithms#

Algorithms are the core of cuda.compute. They operate on arrays or iterators and can be composed to build specialized GPU operations—reductions, scans, sorts, transforms, and more.

Typical usage of an algorithm looks like this:

cuda.compute.reduce_into(
   d_in=...,       # input array or iterator
   d_out=...,      # output array or iterator
   op=...,         # binary operator (built-in or user-defined)
   num_items=...,  # number of input elements
   h_init=...,     # initial value for the reduction
)

API conventions#

Naming — The d_ prefix denotes device memory (e.g., CuPy arrays, PyTorch tensors); h_ denotes host memory (NumPy arrays). Some scalar values must be passed as host arrays.
Output semantics — Algorithms write results into a user-provided array or iterator rather than returning them. This keeps memory ownership explicit and lifetimes under your control.
Operators — Many algorithms accept an op parameter. This can be a built-in OpKind value or a user-defined function. When possible, prefer built-in operators (e.g., OpKind.PLUS) over the equivalent user-defined operation (e.g., lambda a, b: a + b) for better performance.
Iterators — Inputs and outputs can be iterators instead of arrays, enabling lazy evaluation and operation fusion.

Full Example#

The following example uses reduce_into to compute the sum of a sequence of integers:

Sum reduction example.#

"""
Sum all values in an array using reduction with PLUS operation.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import OpKind

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)

# Perform the reduction.
cuda.compute.reduce_into(d_input, d_output, OpKind.PLUS, len(d_input), h_init)

# Verify the result.
expected_output = 15
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Sum reduction result: {result}")

Controlling temporary memory#

Many algorithms allocate temporary device memory for intermediate results. For finer control over allocation—or to reuse buffers across calls—use the object-based API. For example, make_reduce_into returns a reusable reduction object that lets you manage memory explicitly.

Controlling temporary memory.#

# create a reducer object:
reducer = cuda.compute.make_reduce_into(d_in, d_out, op, h_init)
# get the temporary storage size by passing None as the first argument:
temp_storage_bytes = reducer(None, d_in, d_out, num_items, h_init)
# allocate the temporary storage as any array-like object
# (e.g., CuPy array, Torch tensor):
temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8)
# perform the reduction, passing the temporary storage as the first argument:
reducer(temp_storage, d_in, d_out, num_items, h_init)

User-Defined Operations#

A powerful feature is the ability to use algorithms with user-defined operations. For example, to compute the sum of only the even values in a sequence, we can use reduce_into with a custom binary operation:

Reduction with a custom binary operation.#

"""
Sum only even values in an array using reduction with custom operation.
"""

import cupy as cp
import numpy as np

import cuda.compute

# Prepare the input and output arrays.
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_input = cp.array([1, 2, 3, 4, 5], dtype=dtype)
d_output = cp.empty(1, dtype=dtype)

# Define the binary operation for the reduction.


def add_op(a, b):
    return (a if a % 2 == 0 else 0) + (b if b % 2 == 0 else 0)


# Perform the reduction.
cuda.compute.reduce_into(d_input, d_output, add_op, len(d_input), h_init)

# Verify the result.
expected_output = 6
assert (d_output == expected_output).all()
result = d_output[0]
print(f"Custom sum reduction result: {result}")

Features and Restrictions#

User-defined operations are compiled into device code using Numba CUDA, so they inherit many of the same features and restrictions as Numba CUDA functions:

Python features and atomic operations supported by Numba CUDA are also supported within user-defined operations.
Nested functions must be decorated with @numba.cuda.jit.
Variables captured in closures or globals follow Numba CUDA semantics: scalars and host arrays are captured by value (as constants), while device arrays are captured by reference.

Iterators#

Iterators represent sequences whose elements are computed on the fly. They can be used in place of arrays in most algorithms, enabling lazy evaluation, operation fusion, and custom data access patterns.

A CountingIterator, for example, represents an integer sequence starting from a given value:

it = CountingIterator(np.int32(1))  # represents [1, 2, 3, 4, ...]

To compute the sum of the first 100 integers, we can pass a CountingIterator directly to reduce_into. No memory is allocated to store the input sequence—the values are generated as needed.

Counting iterator example.#

"""
Example showing how to use counting_iterator.
"""

import functools

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    CountingIterator,
    OpKind,
)

# Prepare the input and output arrays.
first_item = 1
num_items = 100

# Create the counting iterator.
first_it = CountingIterator(np.int32(first_item))

# Prepare the initial value for the reduction.
h_init = np.array([0], dtype=np.int32)

# Prepare the output array.
d_output = cp.empty(1, dtype=np.int32)

# Perform the reduction.
cuda.compute.reduce_into(first_it, d_output, OpKind.PLUS, num_items, h_init)

# Verify the result.
expected_output = functools.reduce(
    lambda a, b: a + b, range(first_item, first_item + num_items)
)
assert (d_output == expected_output).all()
print(f"Counting iterator result: {d_output[0]} (expected: {expected_output})")

Iterators can also be used to fuse operations. In the example below, a TransformIterator lazily applies the square operation to each element of the input sequence. The resulting iterator is then passed to reduce_into to compute the sum of squares.

Because the square is evaluated on demand during the reduction, there is no need to create or store an intermediate array of squared values. The transform and the reduction are fused into a single pass over the data.

Transform iterator example.#

"""
Using ``reduce_into`` with a ``TransformIterator`` to compute the
sum of squares of a sequence of numbers.
"""

import cupy as cp
import numpy as np

from cuda.compute import (
    OpKind,
    TransformIterator,
    reduce_into,
)

# Prepare the input and output arrays.
d_input = cp.arange(10, dtype=np.int32)
d_output = cp.empty(1, dtype=np.int32)
h_init = np.array([0], dtype=np.int32)  # Initial value for the reduction

# Create a TransformIterator to (lazily) apply the square
it_input = TransformIterator(d_input, lambda a: a**2)

# Use `reduce_into` to compute the sum of the squares of the input.
reduce_into(it_input, d_output, OpKind.PLUS, len(d_input), h_init)

# Verify the result.
expected_output = cp.sum(d_input**2).get()
assert d_output[0] == expected_output
print(f"Transform iterator result: {d_output[0]} (expected: {expected_output})")

Some iterators can also be used as the output of an algorithm. In the example below, a TransformOutputIterator applies the square-root operation to the result of a reduction before writing it into the underlying array.

Transform output iterator example.#

"""
TransformOutputIterator example demonstrating reduction with transform output iterator.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    OpKind,
    TransformOutputIterator,
)

# Create input and output arrays
d_input = cp.array([1, 2, 3, 4, 5.0], dtype=np.float32)
d_output = cp.empty(shape=1, dtype=np.float32)


# Define the transform operation to be applied
# to the result of the sum reduction.
def sqrt(x: np.float32) -> np.float32:
    return x**0.5


# Create transform output iterator
d_out_it = TransformOutputIterator(d_output, sqrt)


# Apply a sum reduction into the transform output iterator
cuda.compute.reduce_into(
    d_input,
    d_out_it,
    OpKind.PLUS,
    len(d_input),
    np.asarray([0], dtype=np.float32),
)

assert cp.allclose(d_output, cp.sqrt(cp.sum(d_input)), atol=1e-6)

As another example, ZipIterator combines multiple arrays or iterators into a single logical sequence. In the example below, we combine a counting iterator and an array, creating an iterator that yields (index, value) pairs. This combined iterator is then used as the input to reduce_into to compute the index of the maximum value in the array.

Argmax using a zip iterator.#

"""
Example showing how to use zip_iterator with counting iterator to
find the index with maximum value in an array.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import (
    CountingIterator,
    ZipIterator,
)


def max_by_value(p1, p2):
    """Reduction operation that returns the pair with the larger value."""
    return p1 if p1[1] > p2[1] else p2


# Create the counting iterator.
counting_it = CountingIterator(np.int32(0))

# Prepare the input array.
arr = cp.asarray([0, 1, 2, 4, 7, 3, 5, 6], dtype=np.int32)

# Create the zip iterator.
zip_it = ZipIterator(counting_it, arr)

num_items = 8

# Note: initial value passed as a numpy struct
dtype = np.dtype([("index", np.int32), ("value", np.int32)], align=True)
h_init = np.asarray([(-1, -1)], dtype=dtype)

d_output = cp.empty(1, dtype=dtype)

# Perform the reduction.
cuda.compute.reduce_into(zip_it, d_output, max_by_value, num_items, h_init)

result = d_output.get()[0]
expected_index = 4
expected_value = 7

assert result["index"] == expected_index
assert result["value"] == expected_value

print(
    f"Zip iterator with counting result: index={result['index']} "
    f"(expected: {expected_index}), value={result['value']} (expected: {expected_value})"
)

These examples illustrate a few of the patterns enabled by iterators. See the API reference for the full set of available iterators.

Struct Types#

The gpu_struct decorator defines GPU-compatible struct types. These are useful when you have data laid out as an “array of structures”, similar to NumPy structured arrays.

Custom struct type in a reduction.#

"""
Finding the maximum green value in a sequence of pixels using `reduce_into`
with a custom data type.
"""

import cupy as cp
import numpy as np

import cuda.compute
from cuda.compute import gpu_struct


# Define a custom data type to store the pixel values.
@gpu_struct
class Pixel:
    r: np.int32
    g: np.int32
    b: np.int32


# Define a reduction operation that returns the pixel with the maximum green value.
def max_g_value(x, y):
    return x if x.g > y.g else y


# Prepare the input and output arrays.
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
d_out = cp.empty(1, Pixel.dtype)

# Prepare the initial value for the reduction.
h_init = Pixel(0, 0, 0)

# Perform the reduction.
cuda.compute.reduce_into(d_rgb, d_out, max_g_value, d_rgb.size, h_init)

# Calculate the expected result.
h_rgb = d_rgb.get()
expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]

# Verify the result.
assert expected["g"] == d_out.get()["g"]
result = d_out.get()
print(f"Pixel reduction result: {result}")

Array of Structures vs Structure of Arrays#

When working with structured data, there are two common memory layouts:

Array of Structures (AoS) — each element is a complete struct, stored contiguously. For example, an array of Point structs where each point’s x and y are adjacent in memory.
Structure of Arrays (SoA) — each field is stored in its own array. For example, separate x_coords and y_coords arrays.

cuda.compute supports both layouts:

``gpu_struct`` — defines a true AoS type with named fields
``ZipIterator`` — combines separate arrays into tuples on the fly, letting you work with SoA data as if it were AoS

Caching#

Algorithms in cuda.compute are compiled to GPU code at runtime. To avoid recompiling on every call, build results are cached in memory. When you invoke an algorithm with the same configuration—same dtypes, iterator kinds, operator, and compute capability—the cached build is reused.

What determines the cache key#

Each algorithm computes a cache key from:

Array dtypes — the data types of input and output arrays
Iterator kinds — for iterator inputs/outputs, a descriptor of the iterator type
Operator identity — for user-defined functions, the function’s bytecode, constants, and closure contents (see below)
Compute capability — the GPU architecture of the current device
Algorithm-specific parameters — such as initial value dtype or determinism mode

Note that array contents or pointers are not part of the cache key—only the array’s dtype. This means you can reuse a cached algorithm across different arrays of the same type.

How user-defined functions are cached#

User-defined operators and predicates are hashed based on their bytecode, constants, and closure contents. Two functions with identical bytecode and closures produce the same cache key, even if defined at different source locations.

Closure contents are recursively hashed:

Scalars and host arrays — hashed by value
Device arrays — hashed by pointer, shape, and dtype (not contents)
Nested functions — hashed by their own bytecode and closures

Because device arrays captured in closures are hashed by pointer, changing the array’s contents does not invalidate the cache—only reassigning the variable to a different array does.

Memory considerations#

The cache persists for the lifetime of the process and grows with the number of unique algorithm configurations. In long-running applications or exploratory notebooks, this can accumulate significant memory.

To clear all caches and free memory:

import cuda.compute
cuda.compute.clear_all_caches()

This forces recompilation on the next algorithm invocation—useful for benchmarking compilation time or reclaiming memory.

Examples#

For complete runnable examples and additional usage patterns, see the examples directory.

API Reference#

cuda.compute API Reference

cuda.compute: Parallel Computing Primitives#