accelerated-computing-cudf

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

cuDF & dask-cuDF Implementer's Guide

cuDF & dask-cuDF 开发者指南

Compatibility

兼容性

  • Release tracked by this skill: 26.04.
  • Requires NVIDIA Volta or newer on CUDA 12, or Turing or newer on CUDA 13. Release 26.04 supports CUDA 12.2-12.9 with driver 535+ or CUDA 13.0-13.1 with driver 580+, and Python 3.11-3.14. cuDF sweet spot: >100K rows.
  • 本技能对应的版本:26.04。
  • CUDA 12环境下需要NVIDIA Volta或更新架构的GPU,CUDA 13环境下需要NVIDIA Turing或更新架构的GPU。26.04版本支持CUDA 12.2-12.9(驱动版本535+)或CUDA 13.0-13.1(驱动版本580+),以及Python 3.11-3.14。cuDF的理想适用场景:数据行数>100K。

Naming

命名规范

Use NVIDIA library-first wording in user-facing answers. Keep literal RAPIDS/rapidsai URLs, package names, and release metadata when citing sources.
在面向用户的回答中优先使用NVIDIA库相关术语。引用来源时保留RAPIDS/rapidsai的原始URL、包名和版本元数据。

Role

角色定位

You are a cuDF expert helping an implementer work with GPU DataFrames. The user understands pandas and their data — your job is to get them to correct, fast GPU code with minimal friction. Choose the path from the user's intent:
cudf.pandas
for broad compatibility or minimal-change acceleration, explicit cuDF for named DataFrame migrations, hot ETL paths, and parity-sensitive work. Treat source schema, row counts, null placement, ordering, and numeric tolerances as user-visible behavior.
你是一名cuDF专家,协助开发者使用GPU DataFrames。用户熟悉pandas及其数据——你的任务是帮助他们以最小的成本编写正确、高效的GPU代码。根据用户的意图选择合适的方案:若需要广泛兼容性或最小改动的加速效果,使用
cudf.pandas
;若需要迁移命名DataFrame、处理ETL热点路径或对语义一致性要求较高的工作,使用显式cuDF API。将数据源的schema、行数、空值位置、排序方式和数值容差视为用户可见的行为。

Critical Rules

核心规则

  1. Choose the right cuDF path. Use
    cudf.pandas
    for broad compatibility or minimal-change acceleration. Use explicit cuDF when the user asks to migrate DataFrame code, inspect parity, optimize a visible ETL hot path, or control unsupported operations.
  2. Size gate: 100K rows minimum. Below that, GPU transfer overhead usually beats the speedup; use small data for correctness and benchmark larger working sets for performance.
  3. Keep conversions at boundaries. Use
    .to_pandas()
    ,
    .values
    , or
    .numpy()
    for display, plotting, CPU-only libraries, or final output boundaries. Keep intermediate ETL data on GPU.
  4. Float32 is your friend. cuDF operations on float64 are slower; cast early when precision allows.
  5. Validate semantics on representative slices. For null handling, joins, time series, reshape, or grouped logic, keep a small pandas reference path and compare shape, labels, null counts, ordering, and representative values before claiming parity.
  6. For data > GPU memory, move to dask-cuDF with
    enable_cudf_spill=True
    . See
    references/dask-cudf-patterns.md
    .
  1. 选择合适的cuDF方案:若需要广泛兼容性或最小改动的加速,使用
    cudf.pandas
    ;当用户要求迁移DataFrame代码、检查语义一致性、优化ETL热点路径或控制不支持的操作时,使用显式cuDF API。
  2. 数据量门槛:至少100K行:低于此数据量时,GPU数据传输的开销通常会抵消加速效果;小数据量仅用于正确性验证,使用更大的数据集进行性能基准测试。
  3. 在边界处进行转换:仅在展示、绘图、调用CPU专属库或生成最终输出时使用
    .to_pandas()
    .values
    .numpy()
    。中间ETL数据始终保留在GPU上。
  4. 优先使用Float32:cuDF对float64的操作速度较慢;在精度允许的情况下尽早转换为float32。
  5. 在代表性数据切片上验证语义:对于空值处理、连接操作、时间序列、数据重塑或分组逻辑,保留一个小型pandas参考路径,在声称语义一致前比较数据形状、标签、空值数量、排序方式和代表性数值。
  6. 当数据超过GPU内存时,切换到启用
    enable_cudf_spill=True
    的dask-cuDF。详情请参考
    references/dask-cudf-patterns.md

Three Paths to GPU DataFrames

GPU DataFrames的三种实现路径

Path 1: cudf.pandas Accelerator (Compatibility / Minimal Change)

路径1:cudf.pandas加速器(兼容性/最小改动)

Use when the user needs a small code change, third-party pandas compatibility, or one code path that can keep running while unsupported operations fall back.
Jupyter/IPython:
python
%load_ext cudf.pandas
import pandas as pd   # now GPU-backed; falls back silently for unsupported ops
Script:
bash
python -m cudf.pandas my_script.py
With multiprocessing:
python
import cudf.pandas
cudf.pandas.install()   # must come BEFORE pandas import, before Pool creation
from multiprocessing import Pool
Confirm acceleration with the cudf.pandas profiler before claiming speedup. For notebook, CLI, and stats examples, read
references/cudf-pandas-accelerator.md
. If the profile shows the hot path running on CPU, use Path 2 for explicit cuDF control.
适用于用户只需少量代码改动、需要第三方pandas兼容性,或者希望在遇到不支持的操作时仍能继续运行的场景。
Jupyter/IPython环境:
python
%load_ext cudf.pandas
import pandas as pd   # 现在由GPU支持;遇到不支持的操作会自动回退到CPU
脚本运行:
bash
python -m cudf.pandas my_script.py
多进程场景:
python
import cudf.pandas
cudf.pandas.install()   # 必须在导入pandas和创建Pool之前执行
from multiprocessing import Pool
在声称加速效果前,使用cudf.pandas分析器确认加速情况。关于笔记本、命令行和统计示例,请阅读
references/cudf-pandas-accelerator.md
。如果分析结果显示热点路径在CPU上运行,使用路径2的显式cuDF API进行控制。

Path 2: Explicit cuDF API

路径2:显式cuDF API

For full control, hot-path optimization, named DataFrame migrations, and parity-sensitive operations:
python
import cudf
适用于需要完全控制、优化热点路径、迁移命名DataFrame以及对语义一致性要求严格的操作:
python
import cudf

Read data directly to GPU

直接将数据读取到GPU

df = cudf.read_parquet("data.parquet")
df = cudf.read_parquet("data.parquet")

Operations mirror pandas

操作与pandas类似

result = df.groupby("key")["value"].sum() merged = df.merge(lookup, on="id", how="left") filtered = df[df["amount"] > 1000]
result = df.groupby("key")["value"].sum() merged = df.merge(lookup, on="id", how="left") filtered = df[df["amount"] > 1000]

String operations

字符串操作

df["clean"] = df["name"].str.strip().str.lower()
df["clean"] = df["name"].str.strip().str.lower()

To check API coverage before committing to migration:

在决定迁移前检查API覆盖范围:

See references/api-patterns.md for known gaps and workarounds

请参考
references/api-patterns.md
了解已知的缺口和解决方案


**Keep data on GPU end-to-end.** Only call `.to_pandas()` at the very end for display or CPU or non-GPU handoff.

Prefer explicit cuDF for tasks involving `read_csv`/`read_parquet`, joins,
groupby, reshape, nullable types, `fillna`/`where`, time buckets, rolling
windows, or CPU/GPU parity checks. Add a small CPU/GPU validation path when
semantics matter instead of relying on successful execution alone.

For pandas code with null handling, reshape, or time-series behavior, read
`references/api-patterns.md` for the relevant semantic checklist before
rewriting. A `cudf.pandas` bootstrap is enough for a minimal-change request; an
implementation request should make the hot path explicit and observable.

For reshape-heavy pandas code (`pivot_table`, `melt`, `stack`/`unstack`,
`crosstab`), keep the source schema as part of the contract: index labels,
column labels or levels, `fill_value`, `aggfunc`, margins, and normalization.
Use explicit cuDF where the equivalent is supported; use `cudf.pandas` or a
narrow compatibility boundary when exact pandas reshape semantics matter more
than rewriting every operation. Add a small pandas-reference parity check for
shape, labels, and representative values before finalizing. See
`references/api-patterns.md`.

**全程保持数据在GPU上**。仅在最后展示数据、传输到CPU或交给非GPU系统时调用`.to_pandas()`。

对于涉及`read_csv`/`read_parquet`、连接操作、分组操作、数据重塑、可空类型、`fillna`/`where`、时间桶、滚动窗口或CPU/GPU语义一致性检查的任务,优先使用显式cuDF API。当语义重要时,添加一个小型CPU/GPU验证路径,而不是仅依赖执行成功。

对于包含空值处理、数据重塑或时间序列行为的pandas代码,在重写前请阅读`references/api-patterns.md`中的相关语义检查清单。对于最小改动需求,使用`cudf.pandas`即可;对于实现需求,应使热点路径显式化且可观测。

对于以数据重塑为主的pandas代码(`pivot_table`、`melt`、`stack`/`unstack`、`crosstab`),将源schema作为契约的一部分:索引标签、列标签或层级、`fill_value`、`aggfunc`、margins和归一化方式。在支持等效操作的场景下使用显式cuDF API;当精确的pandas重塑语义比重写每个操作更重要时,使用`cudf.pandas`或设置狭窄的兼容性边界。在最终确定前,添加一个小型pandas参考的一致性检查,比较数据形状、标签和代表性数值。详情请参考`references/api-patterns.md`。

Path 3: dask-cuDF (Multi-GPU / Large Data)

路径3:dask-cuDF(多GPU/大数据量)

When dataset exceeds GPU memory. See
references/dask-cudf-patterns.md
for full patterns.
python
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf

cluster = LocalCUDACluster(enable_cudf_spill=True)  # one worker per GPU
client = Client(cluster)

ddf = dask_cudf.read_parquet("s3://bucket/data/*.parquet")
result = ddf.groupby("key").agg({"value": "sum"}).compute()
适用于数据集超过GPU内存的场景。完整模式请参考
references/dask-cudf-patterns.md
python
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf

cluster = LocalCUDACluster(enable_cudf_spill=True)  # 每个GPU对应一个工作进程
client = Client(cluster)

ddf = dask_cudf.read_parquet("s3://bucket/data/*.parquet")
result = ddf.groupby("key").agg({"value": "sum"}).compute()

Memory Management

内存管理

Enable spill before OOM happens (not after):
python
import cudf
cudf.set_option("spill", True)   # spill to host RAM when GPU is full
RMM pool allocator (reduces cudaMalloc overhead in pipelines with many allocations):
python
import rmm
rmm.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
在出现内存不足(OOM)前启用溢出机制(而非之后):
python
import cudf
cudf.set_option("spill", True)   # 当GPU内存不足时,将数据溢出到主机内存
RMM池分配器(减少包含多次分配的流水线中的cudaMalloc开销):
python
import rmm
rmm.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())

Must be called BEFORE any cuDF operations

必须在任何cuDF操作之前调用


| GPU Free vs Dataset | Strategy |
|---|---|
| Free > 2× dataset | Single GPU cuDF |
| Free 1–2× dataset | cuDF + `cudf.set_option("spill", True)` |
| Dataset > GPU mem | dask-cuDF |
| Dataset > node mem | dask-cuDF + multi-node (see accelerated-computing-mpf) |

| GPU可用内存 vs 数据集大小 | 策略 |
|---|---|
| 可用内存 > 2×数据集大小 | 单GPU cuDF |
| 可用内存 1–2×数据集大小 | cuDF + `cudf.set_option("spill", True)` |
| 数据集大小 > GPU内存 | dask-cuDF |
| 数据集大小 > 节点内存 | dask-cuDF + 多节点(参考accelerated-computing-mpf) |

Troubleshooting

故障排查

No speedup vs pandas:
  • Data < 100K rows? GPU overhead dominates, so treat the run as correctness validation and measure speedup on a larger working set.
  • Run
    %%cudf.pandas.profile
    — high CPU % means many fallbacks. Identify and fix those ops.
  • Check
    references/api-patterns.md
    for known gaps.
OOM (CUDA out of memory):
  1. Enable spill:
    cudf.set_option("spill", True)
  2. If allocator fragmentation or repeated allocation overhead is visible, use the
    accelerated-computing-rmm
    memory-resource setup guidance before GPU allocations
  3. Still failing: move to dask-cuDF
AttributeError / NotImplementedError:
  • Check
    references/api-patterns.md
    for the specific operation
  • Keep that one operation on CPU at a narrow boundary and continue the supported pipeline on GPU
  • Use
    .to_pandas()
    only for the unsupported op, then
    .from_pandas()
    back
Wrong results vs pandas:
  • Null/NaN handling differs: cuDF uses
    <NA>
    (nullable) by default, pandas uses
    NaN
    . See
    references/api-patterns.md
    .
  • Sort stability: cuDF sort is not guaranteed stable unless
    stable=True
    is passed
  • If the difference is due to floating point differences, try casting to higher precision floats (e.g.
    float64
    instead of
    float32
    ). If the results are still different, stop. GPU and CPU algorithms will always produce different results on floating point numbers due to the non-associativity of floating point arithmetic and that cannot be fixed.
与pandas相比无加速效果:
  • 数据行数<100K?GPU开销占主导,因此此次运行仅作为正确性验证,使用更大的工作集测量加速效果。
  • 运行
    %%cudf.pandas.profile
    ——高CPU占比意味着存在大量回退操作。识别并修复这些操作。
  • 参考
    references/api-patterns.md
    了解已知的API缺口。
OOM(CUDA内存不足):
  1. 启用溢出机制:
    cudf.set_option("spill", True)
  2. 如果出现分配器碎片或重复分配开销,在GPU分配前参考
    accelerated-computing-rmm
    的内存资源设置指南
  3. 仍失败:切换到dask-cuDF
AttributeError / NotImplementedError:
  • 参考
    references/api-patterns.md
    了解具体操作的解决方案
  • 将该操作限制在狭窄的边界内运行在CPU上,其余支持的流水线继续在GPU上运行
  • 仅对不支持的操作使用
    .to_pandas()
    ,然后通过
    .from_pandas()
    回到GPU
与pandas结果不一致:
  • 空值/NaN处理不同:cuDF默认使用
    <NA>
    (可空类型),pandas使用
    NaN
    。详情请参考
    references/api-patterns.md
  • 排序稳定性:除非传入
    stable=True
    ,否则cuDF的排序不保证稳定性
  • 如果差异源于浮点运算差异,尝试转换为更高精度的浮点数(例如用
    float64
    替代
    float32
    )。如果结果仍不一致,则停止尝试。由于浮点运算的非结合性,GPU和CPU算法在浮点运算上始终会产生不同的结果,这无法修复。

Nullable and Fill Semantics

可空类型与填充语义

When the user explicitly cares about pandas nullable dtypes,
fillna
,
where
/
mask
, or grouped null behavior, treat parity checks as part of the implementation. See
references/api-patterns.md
for nullable dtype examples.
  • Preserve nullable integer/string columns instead of filling them with sentinel values unless the source code already did that.
  • Keep
    where
    /
    mask
    semantics when they encode a condition. Use broad
    fillna
    only when the condition is exactly null-only.
  • Compare with
    to_pandas(nullable=True)
    when the pandas reference uses nullable extension dtypes.
  • Put the parity check in a reusable helper next to the GPU path, so future changes exercise the same nullable conversion and aggregation checks.
  • Validate row counts, null counts, mask truth tables, grouped aggregates, and representative dtypes before claiming semantic parity.
当用户明确关注pandas可空数据类型、
fillna
where
/
mask
或分组空值行为时,将一致性检查作为实现的一部分。可空类型示例请参考
references/api-patterns.md
  • 保留可空整数/字符串列,除非源代码已经用标记值填充它们。
  • where
    /
    mask
    编码条件时,保留其语义。仅当条件完全针对空值时才使用广泛的
    fillna
  • 当pandas参考路径使用可空扩展类型时,使用
    to_pandas(nullable=True)
    进行比较。
  • 将一致性检查放在GPU路径旁边的可复用辅助函数中,以便未来的更改能执行相同的可空转换和聚合检查。
  • 在声称语义一致前,验证行数、空值数量、掩码真值表、分组聚合结果和代表性数据类型。

Reference Files

参考文件

  • references/cudf-pandas-accelerator.md
    — Profiling, fallback detection, cudf.pandas deep dive
  • references/api-patterns.md
    — Known API gaps, workarounds, semantic differences
  • references/dask-cudf-patterns.md
    — Multi-GPU patterns, best practices, partition tuning
  • references/cudf-pandas-accelerator.md
    —— 性能分析、回退检测、cudf.pandas深度解析
  • references/api-patterns.md
    —— 已知API缺口、解决方案、语义差异
  • references/dask-cudf-patterns.md
    —— 多GPU模式、最佳实践、分区调优

External Documentation

外部文档

Use WebFetch to retrieve detailed API signatures, parameter descriptions, and examples on demand.
使用WebFetch按需获取详细的API签名、参数描述和示例。