Skip to content

Add polygon rasterization from GeoDataFrame (scanline fill, no GDAL) #989

@brendancol

Description

@brendancol

Problem

xarray-spatial can't rasterize vector data. If you have polygons in a GeoDataFrame or GeoParquet file and want to run slope, aspect, or zonal stats on them, you have to rasterize outside the library first.

rasterio would be the obvious choice, but it depends on GDAL, which is a heavy native dependency. This issue proposes a standalone rasterize function using scanline fill, with numpy/numba and cupy backends matching the rest of the library.

Approach: scanline fill

Scanline fill converts polygons to rasters one row at a time: compute x-intersections with polygon edges, sort them, fill between pairs. Same algorithm GDAL uses internally.

The pipeline:

  1. Parse GeoDataFrame into flat vertex/edge arrays (numpy, on host)
  2. Build a sorted edge table (numpy)
  3. For GPU: transfer the edge table to device (small relative to the output)
  4. Scanline fill: @ngjit row loop on CPU, or a CUDA kernel with one thread per row on GPU
  5. Output an xr.DataArray -- numpy on CPU, cupy on GPU (stays on device for downstream ops)

GPU suitability

Without GDAL doing the work in C, the CPU path is numba -- so the bar for GPU to beat is lower than usual:

  • Each raster row is independent: one thread per row, no synchronization
  • Control flow is uniform (walk edges, compute intersections, fill). Only divergence is how many edges hit a given row
  • Adjacent threads write adjacent memory (coalesced)
  • The edge table is read-only and fits in L2 cache
  • Output stays on device, so downstream xarray-spatial ops skip the host round-trip

preview() gets 100-240x GPU speedup on block reductions for 20K+ rasters on an A6000. Scanline fill has a similar work profile, so large output grids should see comparable gains.

Why not point-in-polygon?

Testing every pixel against every polygon is O(pixels * polygons) and branch-heavy. Scanline fill is O(rows * edges), and edges are typically orders of magnitude fewer than pixels. Control flow is also more uniform across GPU threads.

Function signature

def rasterize(
    geometries,          # GeoDataFrame or list of (geometry, value) pairs
    width: int,
    height: int,
    bounds: tuple = None,        # (xmin, ymin, xmax, ymax); inferred from geometries if omitted
    fill: float = np.nan,
    dtype: np.dtype = np.float64,
    all_touched: bool = False,
    name: str = 'rasterize',
) -> xr.DataArray

Returns a 2D xr.DataArray with y/x coordinates derived from bounds and resolution.

Scope

This issue covers polygon and MultiPolygon rasterization with numpy + cupy backends, burning a single attribute column.

Line/point support, dask (spatial partitioning per chunk), and multi-attribute 3D output can follow separately.

Dependencies

  • geopandas (geometry parsing)
  • shapely (vertex extraction via get_coordinates, get_rings)
  • No GDAL, no rasterio

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions