-
Notifications
You must be signed in to change notification settings - Fork 86
Description
Problem
xarray-spatial can't rasterize vector data. If you have polygons in a GeoDataFrame or GeoParquet file and want to run slope, aspect, or zonal stats on them, you have to rasterize outside the library first.
rasterio would be the obvious choice, but it depends on GDAL, which is a heavy native dependency. This issue proposes a standalone rasterize function using scanline fill, with numpy/numba and cupy backends matching the rest of the library.
Approach: scanline fill
Scanline fill converts polygons to rasters one row at a time: compute x-intersections with polygon edges, sort them, fill between pairs. Same algorithm GDAL uses internally.
The pipeline:
- Parse GeoDataFrame into flat vertex/edge arrays (numpy, on host)
- Build a sorted edge table (numpy)
- For GPU: transfer the edge table to device (small relative to the output)
- Scanline fill:
@ngjitrow loop on CPU, or a CUDA kernel with one thread per row on GPU - Output an
xr.DataArray-- numpy on CPU, cupy on GPU (stays on device for downstream ops)
GPU suitability
Without GDAL doing the work in C, the CPU path is numba -- so the bar for GPU to beat is lower than usual:
- Each raster row is independent: one thread per row, no synchronization
- Control flow is uniform (walk edges, compute intersections, fill). Only divergence is how many edges hit a given row
- Adjacent threads write adjacent memory (coalesced)
- The edge table is read-only and fits in L2 cache
- Output stays on device, so downstream xarray-spatial ops skip the host round-trip
preview() gets 100-240x GPU speedup on block reductions for 20K+ rasters on an A6000. Scanline fill has a similar work profile, so large output grids should see comparable gains.
Why not point-in-polygon?
Testing every pixel against every polygon is O(pixels * polygons) and branch-heavy. Scanline fill is O(rows * edges), and edges are typically orders of magnitude fewer than pixels. Control flow is also more uniform across GPU threads.
Function signature
def rasterize(
geometries, # GeoDataFrame or list of (geometry, value) pairs
width: int,
height: int,
bounds: tuple = None, # (xmin, ymin, xmax, ymax); inferred from geometries if omitted
fill: float = np.nan,
dtype: np.dtype = np.float64,
all_touched: bool = False,
name: str = 'rasterize',
) -> xr.DataArrayReturns a 2D xr.DataArray with y/x coordinates derived from bounds and resolution.
Scope
This issue covers polygon and MultiPolygon rasterization with numpy + cupy backends, burning a single attribute column.
Line/point support, dask (spatial partitioning per chunk), and multi-attribute 3D output can follow separately.
Dependencies
geopandas(geometry parsing)shapely(vertex extraction viaget_coordinates,get_rings)- No GDAL, no rasterio