cuDF & dask-cuDF Implementer's Guide

Compatibility

Release tracked by this skill: 26.04.
Requires NVIDIA Volta or newer on CUDA 12, or Turing or newer on CUDA 13. Release 26.04 supports CUDA 12.2-12.9 with driver 535+ or CUDA 13.0-13.1 with driver 580+, and Python 3.11-3.14. cuDF sweet spot: >100K rows.

Naming

Use NVIDIA library-first wording in user-facing answers. Keep literal RAPIDS/rapidsai URLs, package names, and release metadata when citing sources.

Role

You are a cuDF expert helping an implementer work with GPU DataFrames. The user understands pandas and their data — your job is to get them to correct, fast GPU code with minimal friction. Choose the path from the user's intent:

cudf.pandas

for broad compatibility or minimal-change acceleration, explicit cuDF for named DataFrame migrations, hot ETL paths, and parity-sensitive work. Treat source schema, row counts, null placement, ordering, and numeric tolerances as user-visible behavior.

Critical Rules

Choose the right cuDF path. Use
```
cudf.pandas
```
for broad compatibility or minimal-change acceleration. Use explicit cuDF when the user asks to migrate DataFrame code, inspect parity, optimize a visible ETL hot path, or control unsupported operations.
Size gate: 100K rows minimum. Below that, GPU transfer overhead usually beats the speedup; use small data for correctness and benchmark larger working sets for performance.
Keep conversions at boundaries. Use
```
.to_pandas()
```
,
```
.values
```
, or
```
.numpy()
```
for display, plotting, CPU-only libraries, or final output boundaries. Keep intermediate ETL data on GPU.
Float32 is your friend. cuDF operations on float64 are slower; cast early when precision allows.
Validate semantics on representative slices. For null handling, joins, time series, reshape, or grouped logic, keep a small pandas reference path and compare shape, labels, null counts, ordering, and representative values before claiming parity.

For data > GPU memory, move to dask-cuDF with

enable_cudf_spill=True

. See

references/dask-cudf-patterns.md

Three Paths to GPU DataFrames

Path 1: cudf.pandas Accelerator (Compatibility / Minimal Change)

Use when the user needs a small code change, third-party pandas compatibility, or one code path that can keep running while unsupported operations fall back.

Jupyter/IPython:

python

%load_ext cudf.pandas
import pandas as pd   # now GPU-backed; falls back silently for unsupported ops

Script:

bash

python -m cudf.pandas my_script.py

With multiprocessing:

python

import cudf.pandas
cudf.pandas.install()   # must come BEFORE pandas import, before Pool creation
from multiprocessing import Pool

Confirm acceleration with the cudf.pandas profiler before claiming speedup. For notebook, CLI, and stats examples, read

references/cudf-pandas-accelerator.md

. If the profile shows the hot path running on CPU, use Path 2 for explicit cuDF control.

Path 2: Explicit cuDF API

For full control, hot-path optimization, named DataFrame migrations, and parity-sensitive operations:

python

import cudf

# Read data directly to GPU
df = cudf.read_parquet("data.parquet")

# Operations mirror pandas
result = df.groupby("key")["value"].sum()
merged = df.merge(lookup, on="id", how="left")
filtered = df[df["amount"] > 1000]

# String operations
df["clean"] = df["name"].str.strip().str.lower()

# To check API coverage before committing to migration:
# See references/api-patterns.md for known gaps and workarounds

Keep data on GPU end-to-end. Only call

.to_pandas()

at the very end for display or CPU or non-GPU handoff.

Prefer explicit cuDF for tasks involving

read_csv

read_parquet

, joins, groupby, reshape, nullable types,

fillna

where

, time buckets, rolling windows, or CPU/GPU parity checks. Add a small CPU/GPU validation path when semantics matter instead of relying on successful execution alone.

For pandas code with null handling, reshape, or time-series behavior, read

references/api-patterns.md

for the relevant semantic checklist before rewriting. A

cudf.pandas

bootstrap is enough for a minimal-change request; an implementation request should make the hot path explicit and observable.

For reshape-heavy pandas code (

pivot_table

melt

stack

unstack

crosstab

), keep the source schema as part of the contract: index labels, column labels or levels,

fill_value

aggfunc

, margins, and normalization. Use explicit cuDF where the equivalent is supported; use

cudf.pandas

or a narrow compatibility boundary when exact pandas reshape semantics matter more than rewriting every operation. Add a small pandas-reference parity check for shape, labels, and representative values before finalizing. See

references/api-patterns.md

Path 3: dask-cuDF (Multi-GPU / Large Data)

When dataset exceeds GPU memory. See

references/dask-cudf-patterns.md

for full patterns.

python

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf

cluster = LocalCUDACluster(enable_cudf_spill=True)  # one worker per GPU
client = Client(cluster)

ddf = dask_cudf.read_parquet("s3://bucket/data/*.parquet")
result = ddf.groupby("key").agg({"value": "sum"}).compute()

Memory Management

Enable spill before OOM happens (not after):

python

import cudf
cudf.set_option("spill", True)   # spill to host RAM when GPU is full

RMM pool allocator (reduces cudaMalloc overhead in pipelines with many allocations):

python

import rmm
rmm.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
# Must be called BEFORE any cuDF operations

GPU Free vs Dataset	Strategy
Free > 2× dataset	Single GPU cuDF
Free 1–2× dataset	cuDF + `cudf.set_option("spill", True)`
Dataset > GPU mem	dask-cuDF
Dataset > node mem	dask-cuDF + multi-node (see accelerated-computing-mpf)

Troubleshooting

No speedup vs pandas:

Data < 100K rows? GPU overhead dominates, so treat the run as correctness validation and measure speedup on a larger working set.
Run
```
%%cudf.pandas.profile
```
— high CPU % means many fallbacks. Identify and fix those ops.
Check
```
references/api-patterns.md
```
for known gaps.

OOM (CUDA out of memory):

Enable spill:
```
cudf.set_option("spill", True)
```
If allocator fragmentation or repeated allocation overhead is visible, use the
```
accelerated-computing-rmm
```
memory-resource setup guidance before GPU allocations
Still failing: move to dask-cuDF

AttributeError / NotImplementedError:

Check
```
references/api-patterns.md
```
for the specific operation
Keep that one operation on CPU at a narrow boundary and continue the supported pipeline on GPU
Use
```
.to_pandas()
```
only for the unsupported op, then
```
.from_pandas()
```
back

Wrong results vs pandas:

Null/NaN handling differs: cuDF uses
```
<NA>
```
(nullable) by default, pandas uses
```
NaN
```
. See
```
references/api-patterns.md
```
.
Sort stability: cuDF sort is not guaranteed stable unless
```
stable=True
```
is passed
If the difference is due to floating point differences, try casting to higher precision floats (e.g.
```
float64
```
instead of
```
float32
```
). If the results are still different, stop. GPU and CPU algorithms will always produce different results on floating point numbers due to the non-associativity of floating point arithmetic and that cannot be fixed.

Nullable and Fill Semantics

When the user explicitly cares about pandas nullable dtypes,

fillna

where

mask

, or grouped null behavior, treat parity checks as part of the implementation. See

references/api-patterns.md

for nullable dtype examples.

Preserve nullable integer/string columns instead of filling them with sentinel values unless the source code already did that.
Keep
```
where
```
/
```
mask
```
semantics when they encode a condition. Use broad
```
fillna
```
only when the condition is exactly null-only.
Compare with
```
to_pandas(nullable=True)
```
when the pandas reference uses nullable extension dtypes.
Put the parity check in a reusable helper next to the GPU path, so future changes exercise the same nullable conversion and aggregation checks.
Validate row counts, null counts, mask truth tables, grouped aggregates, and representative dtypes before claiming semantic parity.

Reference Files

```
references/cudf-pandas-accelerator.md
```
— Profiling, fallback detection, cudf.pandas deep dive
```
references/api-patterns.md
```
— Known API gaps, workarounds, semantic differences
```
references/dask-cudf-patterns.md
```
— Multi-GPU patterns, best practices, partition tuning

External Documentation

Use WebFetch to retrieve detailed API signatures, parameter descriptions, and examples on demand.

cuDF Documentation: https://docs.rapids.ai/api/cudf/stable/
dask-cuDF API Reference: https://docs.rapids.ai/api/dask-cudf/stable/api/
GitHub: https://github.com/rapidsai/cudf
CHANGELOG: https://github.com/rapidsai/cudf/blob/main/CHANGELOG.md

accelerated-computing-cudf

NPX Install

Tags

SKILL.md Content

cuDF & dask-cuDF Implementer's Guide

Compatibility

Naming

Role

Critical Rules

Three Paths to GPU DataFrames

Path 1: cudf.pandas Accelerator (Compatibility / Minimal Change)

Path 2: Explicit cuDF API

Path 3: dask-cuDF (Multi-GPU / Large Data)

Memory Management

Troubleshooting

Nullable and Fill Semantics

Reference Files

External Documentation