accelerated-computing-cudf
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesecuDF & dask-cuDF Implementer's Guide
cuDF & dask-cuDF 开发者指南
Compatibility
兼容性
- Release tracked by this skill: 26.04.
- Requires NVIDIA Volta or newer on CUDA 12, or Turing or newer on CUDA 13. Release 26.04 supports CUDA 12.2-12.9 with driver 535+ or CUDA 13.0-13.1 with driver 580+, and Python 3.11-3.14. cuDF sweet spot: >100K rows.
- 本技能对应的版本:26.04。
- CUDA 12环境下需要NVIDIA Volta或更新架构的GPU,CUDA 13环境下需要NVIDIA Turing或更新架构的GPU。26.04版本支持CUDA 12.2-12.9(驱动版本535+)或CUDA 13.0-13.1(驱动版本580+),以及Python 3.11-3.14。cuDF的理想适用场景:数据行数>100K。
Naming
命名规范
Use NVIDIA library-first wording in user-facing answers. Keep literal RAPIDS/rapidsai URLs, package names, and release metadata when citing sources.
在面向用户的回答中优先使用NVIDIA库相关术语。引用来源时保留RAPIDS/rapidsai的原始URL、包名和版本元数据。
Role
角色定位
You are a cuDF expert helping an implementer work with GPU DataFrames. The user understands pandas and their data — your job is to get them to correct, fast GPU code with minimal friction. Choose the path from the user's intent: for broad compatibility or minimal-change acceleration, explicit cuDF for named DataFrame migrations, hot ETL paths, and parity-sensitive work. Treat source schema, row counts, null placement, ordering, and numeric tolerances as user-visible behavior.
cudf.pandas你是一名cuDF专家,协助开发者使用GPU DataFrames。用户熟悉pandas及其数据——你的任务是帮助他们以最小的成本编写正确、高效的GPU代码。根据用户的意图选择合适的方案:若需要广泛兼容性或最小改动的加速效果,使用;若需要迁移命名DataFrame、处理ETL热点路径或对语义一致性要求较高的工作,使用显式cuDF API。将数据源的schema、行数、空值位置、排序方式和数值容差视为用户可见的行为。
cudf.pandasCritical Rules
核心规则
- Choose the right cuDF path. Use for broad compatibility or minimal-change acceleration. Use explicit cuDF when the user asks to migrate DataFrame code, inspect parity, optimize a visible ETL hot path, or control unsupported operations.
cudf.pandas - Size gate: 100K rows minimum. Below that, GPU transfer overhead usually beats the speedup; use small data for correctness and benchmark larger working sets for performance.
- Keep conversions at boundaries. Use ,
.to_pandas(), or.valuesfor display, plotting, CPU-only libraries, or final output boundaries. Keep intermediate ETL data on GPU..numpy() - Float32 is your friend. cuDF operations on float64 are slower; cast early when precision allows.
- Validate semantics on representative slices. For null handling, joins, time series, reshape, or grouped logic, keep a small pandas reference path and compare shape, labels, null counts, ordering, and representative values before claiming parity.
- For data > GPU memory, move to dask-cuDF with . See
enable_cudf_spill=True.references/dask-cudf-patterns.md
- 选择合适的cuDF方案:若需要广泛兼容性或最小改动的加速,使用;当用户要求迁移DataFrame代码、检查语义一致性、优化ETL热点路径或控制不支持的操作时,使用显式cuDF API。
cudf.pandas - 数据量门槛:至少100K行:低于此数据量时,GPU数据传输的开销通常会抵消加速效果;小数据量仅用于正确性验证,使用更大的数据集进行性能基准测试。
- 在边界处进行转换:仅在展示、绘图、调用CPU专属库或生成最终输出时使用、
.to_pandas()或.values。中间ETL数据始终保留在GPU上。.numpy() - 优先使用Float32:cuDF对float64的操作速度较慢;在精度允许的情况下尽早转换为float32。
- 在代表性数据切片上验证语义:对于空值处理、连接操作、时间序列、数据重塑或分组逻辑,保留一个小型pandas参考路径,在声称语义一致前比较数据形状、标签、空值数量、排序方式和代表性数值。
- 当数据超过GPU内存时,切换到启用的dask-cuDF。详情请参考
enable_cudf_spill=True。references/dask-cudf-patterns.md
Three Paths to GPU DataFrames
GPU DataFrames的三种实现路径
Path 1: cudf.pandas Accelerator (Compatibility / Minimal Change)
路径1:cudf.pandas加速器(兼容性/最小改动)
Use when the user needs a small code change, third-party pandas compatibility,
or one code path that can keep running while unsupported operations fall back.
Jupyter/IPython:
python
%load_ext cudf.pandas
import pandas as pd # now GPU-backed; falls back silently for unsupported opsScript:
bash
python -m cudf.pandas my_script.pyWith multiprocessing:
python
import cudf.pandas
cudf.pandas.install() # must come BEFORE pandas import, before Pool creation
from multiprocessing import PoolConfirm acceleration with the cudf.pandas profiler before claiming speedup.
For notebook, CLI, and stats examples, read
. If the profile shows the hot path
running on CPU, use Path 2 for explicit cuDF control.
references/cudf-pandas-accelerator.md适用于用户只需少量代码改动、需要第三方pandas兼容性,或者希望在遇到不支持的操作时仍能继续运行的场景。
Jupyter/IPython环境:
python
%load_ext cudf.pandas
import pandas as pd # 现在由GPU支持;遇到不支持的操作会自动回退到CPU脚本运行:
bash
python -m cudf.pandas my_script.py多进程场景:
python
import cudf.pandas
cudf.pandas.install() # 必须在导入pandas和创建Pool之前执行
from multiprocessing import Pool在声称加速效果前,使用cudf.pandas分析器确认加速情况。关于笔记本、命令行和统计示例,请阅读。如果分析结果显示热点路径在CPU上运行,使用路径2的显式cuDF API进行控制。
references/cudf-pandas-accelerator.mdPath 2: Explicit cuDF API
路径2:显式cuDF API
For full control, hot-path optimization, named DataFrame migrations, and
parity-sensitive operations:
python
import cudf适用于需要完全控制、优化热点路径、迁移命名DataFrame以及对语义一致性要求严格的操作:
python
import cudfRead data directly to GPU
直接将数据读取到GPU
df = cudf.read_parquet("data.parquet")
df = cudf.read_parquet("data.parquet")
Operations mirror pandas
操作与pandas类似
result = df.groupby("key")["value"].sum()
merged = df.merge(lookup, on="id", how="left")
filtered = df[df["amount"] > 1000]
result = df.groupby("key")["value"].sum()
merged = df.merge(lookup, on="id", how="left")
filtered = df[df["amount"] > 1000]
String operations
字符串操作
df["clean"] = df["name"].str.strip().str.lower()
df["clean"] = df["name"].str.strip().str.lower()
To check API coverage before committing to migration:
在决定迁移前检查API覆盖范围:
See references/api-patterns.md for known gaps and workarounds
请参考references/api-patterns.md
了解已知的缺口和解决方案
references/api-patterns.md
**Keep data on GPU end-to-end.** Only call `.to_pandas()` at the very end for display or CPU or non-GPU handoff.
Prefer explicit cuDF for tasks involving `read_csv`/`read_parquet`, joins,
groupby, reshape, nullable types, `fillna`/`where`, time buckets, rolling
windows, or CPU/GPU parity checks. Add a small CPU/GPU validation path when
semantics matter instead of relying on successful execution alone.
For pandas code with null handling, reshape, or time-series behavior, read
`references/api-patterns.md` for the relevant semantic checklist before
rewriting. A `cudf.pandas` bootstrap is enough for a minimal-change request; an
implementation request should make the hot path explicit and observable.
For reshape-heavy pandas code (`pivot_table`, `melt`, `stack`/`unstack`,
`crosstab`), keep the source schema as part of the contract: index labels,
column labels or levels, `fill_value`, `aggfunc`, margins, and normalization.
Use explicit cuDF where the equivalent is supported; use `cudf.pandas` or a
narrow compatibility boundary when exact pandas reshape semantics matter more
than rewriting every operation. Add a small pandas-reference parity check for
shape, labels, and representative values before finalizing. See
`references/api-patterns.md`.
**全程保持数据在GPU上**。仅在最后展示数据、传输到CPU或交给非GPU系统时调用`.to_pandas()`。
对于涉及`read_csv`/`read_parquet`、连接操作、分组操作、数据重塑、可空类型、`fillna`/`where`、时间桶、滚动窗口或CPU/GPU语义一致性检查的任务,优先使用显式cuDF API。当语义重要时,添加一个小型CPU/GPU验证路径,而不是仅依赖执行成功。
对于包含空值处理、数据重塑或时间序列行为的pandas代码,在重写前请阅读`references/api-patterns.md`中的相关语义检查清单。对于最小改动需求,使用`cudf.pandas`即可;对于实现需求,应使热点路径显式化且可观测。
对于以数据重塑为主的pandas代码(`pivot_table`、`melt`、`stack`/`unstack`、`crosstab`),将源schema作为契约的一部分:索引标签、列标签或层级、`fill_value`、`aggfunc`、margins和归一化方式。在支持等效操作的场景下使用显式cuDF API;当精确的pandas重塑语义比重写每个操作更重要时,使用`cudf.pandas`或设置狭窄的兼容性边界。在最终确定前,添加一个小型pandas参考的一致性检查,比较数据形状、标签和代表性数值。详情请参考`references/api-patterns.md`。Path 3: dask-cuDF (Multi-GPU / Large Data)
路径3:dask-cuDF(多GPU/大数据量)
When dataset exceeds GPU memory. See for full patterns.
references/dask-cudf-patterns.mdpython
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf
cluster = LocalCUDACluster(enable_cudf_spill=True) # one worker per GPU
client = Client(cluster)
ddf = dask_cudf.read_parquet("s3://bucket/data/*.parquet")
result = ddf.groupby("key").agg({"value": "sum"}).compute()适用于数据集超过GPU内存的场景。完整模式请参考。
references/dask-cudf-patterns.mdpython
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf
cluster = LocalCUDACluster(enable_cudf_spill=True) # 每个GPU对应一个工作进程
client = Client(cluster)
ddf = dask_cudf.read_parquet("s3://bucket/data/*.parquet")
result = ddf.groupby("key").agg({"value": "sum"}).compute()Memory Management
内存管理
Enable spill before OOM happens (not after):
python
import cudf
cudf.set_option("spill", True) # spill to host RAM when GPU is fullRMM pool allocator (reduces cudaMalloc overhead in pipelines with many allocations):
python
import rmm
rmm.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())在出现内存不足(OOM)前启用溢出机制(而非之后):
python
import cudf
cudf.set_option("spill", True) # 当GPU内存不足时,将数据溢出到主机内存RMM池分配器(减少包含多次分配的流水线中的cudaMalloc开销):
python
import rmm
rmm.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())Must be called BEFORE any cuDF operations
必须在任何cuDF操作之前调用
| GPU Free vs Dataset | Strategy |
|---|---|
| Free > 2× dataset | Single GPU cuDF |
| Free 1–2× dataset | cuDF + `cudf.set_option("spill", True)` |
| Dataset > GPU mem | dask-cuDF |
| Dataset > node mem | dask-cuDF + multi-node (see accelerated-computing-mpf) |
| GPU可用内存 vs 数据集大小 | 策略 |
|---|---|
| 可用内存 > 2×数据集大小 | 单GPU cuDF |
| 可用内存 1–2×数据集大小 | cuDF + `cudf.set_option("spill", True)` |
| 数据集大小 > GPU内存 | dask-cuDF |
| 数据集大小 > 节点内存 | dask-cuDF + 多节点(参考accelerated-computing-mpf) |Troubleshooting
故障排查
No speedup vs pandas:
- Data < 100K rows? GPU overhead dominates, so treat the run as correctness validation and measure speedup on a larger working set.
- Run — high CPU % means many fallbacks. Identify and fix those ops.
%%cudf.pandas.profile - Check for known gaps.
references/api-patterns.md
OOM (CUDA out of memory):
- Enable spill:
cudf.set_option("spill", True) - If allocator fragmentation or repeated allocation overhead is visible, use the memory-resource setup guidance before GPU allocations
accelerated-computing-rmm - Still failing: move to dask-cuDF
AttributeError / NotImplementedError:
- Check for the specific operation
references/api-patterns.md - Keep that one operation on CPU at a narrow boundary and continue the supported pipeline on GPU
- Use only for the unsupported op, then
.to_pandas()back.from_pandas()
Wrong results vs pandas:
- Null/NaN handling differs: cuDF uses (nullable) by default, pandas uses
<NA>. SeeNaN.references/api-patterns.md - Sort stability: cuDF sort is not guaranteed stable unless is passed
stable=True - If the difference is due to floating point differences, try casting to higher precision floats (e.g. instead of
float64). If the results are still different, stop. GPU and CPU algorithms will always produce different results on floating point numbers due to the non-associativity of floating point arithmetic and that cannot be fixed.float32
与pandas相比无加速效果:
- 数据行数<100K?GPU开销占主导,因此此次运行仅作为正确性验证,使用更大的工作集测量加速效果。
- 运行——高CPU占比意味着存在大量回退操作。识别并修复这些操作。
%%cudf.pandas.profile - 参考了解已知的API缺口。
references/api-patterns.md
OOM(CUDA内存不足):
- 启用溢出机制:
cudf.set_option("spill", True) - 如果出现分配器碎片或重复分配开销,在GPU分配前参考的内存资源设置指南
accelerated-computing-rmm - 仍失败:切换到dask-cuDF
AttributeError / NotImplementedError:
- 参考了解具体操作的解决方案
references/api-patterns.md - 将该操作限制在狭窄的边界内运行在CPU上,其余支持的流水线继续在GPU上运行
- 仅对不支持的操作使用,然后通过
.to_pandas()回到GPU.from_pandas()
与pandas结果不一致:
- 空值/NaN处理不同:cuDF默认使用(可空类型),pandas使用
<NA>。详情请参考NaN。references/api-patterns.md - 排序稳定性:除非传入,否则cuDF的排序不保证稳定性
stable=True - 如果差异源于浮点运算差异,尝试转换为更高精度的浮点数(例如用替代
float64)。如果结果仍不一致,则停止尝试。由于浮点运算的非结合性,GPU和CPU算法在浮点运算上始终会产生不同的结果,这无法修复。float32
Nullable and Fill Semantics
可空类型与填充语义
When the user explicitly cares about pandas nullable dtypes, ,
/, or grouped null behavior, treat parity checks as part of the
implementation. See for nullable dtype examples.
fillnawheremaskreferences/api-patterns.md- Preserve nullable integer/string columns instead of filling them with sentinel values unless the source code already did that.
- Keep /
wheresemantics when they encode a condition. Use broadmaskonly when the condition is exactly null-only.fillna - Compare with when the pandas reference uses nullable extension dtypes.
to_pandas(nullable=True) - Put the parity check in a reusable helper next to the GPU path, so future changes exercise the same nullable conversion and aggregation checks.
- Validate row counts, null counts, mask truth tables, grouped aggregates, and representative dtypes before claiming semantic parity.
当用户明确关注pandas可空数据类型、、/或分组空值行为时,将一致性检查作为实现的一部分。可空类型示例请参考。
fillnawheremaskreferences/api-patterns.md- 保留可空整数/字符串列,除非源代码已经用标记值填充它们。
- 当/
where编码条件时,保留其语义。仅当条件完全针对空值时才使用广泛的mask。fillna - 当pandas参考路径使用可空扩展类型时,使用进行比较。
to_pandas(nullable=True) - 将一致性检查放在GPU路径旁边的可复用辅助函数中,以便未来的更改能执行相同的可空转换和聚合检查。
- 在声称语义一致前,验证行数、空值数量、掩码真值表、分组聚合结果和代表性数据类型。
Reference Files
参考文件
- — Profiling, fallback detection, cudf.pandas deep dive
references/cudf-pandas-accelerator.md - — Known API gaps, workarounds, semantic differences
references/api-patterns.md - — Multi-GPU patterns, best practices, partition tuning
references/dask-cudf-patterns.md
- —— 性能分析、回退检测、cudf.pandas深度解析
references/cudf-pandas-accelerator.md - —— 已知API缺口、解决方案、语义差异
references/api-patterns.md - —— 多GPU模式、最佳实践、分区调优
references/dask-cudf-patterns.md
External Documentation
外部文档
Use WebFetch to retrieve detailed API signatures, parameter descriptions, and examples on demand.
- cuDF Documentation: https://docs.rapids.ai/api/cudf/stable/
- dask-cuDF API Reference: https://docs.rapids.ai/api/dask-cudf/stable/api/
- GitHub: https://github.com/rapidsai/cudf
- CHANGELOG: https://github.com/rapidsai/cudf/blob/main/CHANGELOG.md
使用WebFetch按需获取详细的API签名、参数描述和示例。
- cuDF文档: https://docs.rapids.ai/api/cudf/stable/
- dask-cuDF API参考: https://docs.rapids.ai/api/dask-cudf/stable/api/
- GitHub仓库: https://github.com/rapidsai/cudf
- 更新日志: https://github.com/rapidsai/cudf/blob/main/CHANGELOG.md