cpu-gpu-performance

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CPU/GPU Performance Discipline

CPU/GPU性能规范

When To Use

适用场景

At the beginning of every session (auto-load alongside
```
token-conservation
```
).
Whenever you plan to build, train, or test anything that could pin CPU cores or GPUs for more than a minute.
Before retrying a failing command that previously consumed significant resources.

每次会话开始时（与
```
token-conservation
```
自动加载）。
当你计划构建、训练或测试任何可能让CPU核心或GPU持续占用超过一分钟的内容时。
在重试之前消耗大量资源的失败命令之前。

When NOT To Use

不适用场景

Simple operations with no resource impact
Quick single-file operations

无资源影响的简单操作
快速的单文件操作

Required TodoWrite Items

所需TodoWrite项

```
cpu-gpu-performance:baseline
```
```
cpu-gpu-performance:scope
```
```
cpu-gpu-performance:instrument
```
```
cpu-gpu-performance:throttle
```
```
cpu-gpu-performance:log
```

```
cpu-gpu-performance:baseline
```
```
cpu-gpu-performance:scope
```
```
cpu-gpu-performance:instrument
```
```
cpu-gpu-performance:throttle
```
```
cpu-gpu-performance:log
```

Step 1: Establish Current Baseline

步骤1：建立当前基准线

Capture current utilization:

```
uptime
```
```
ps -eo pcpu,cmd | head
```

nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv

Note which hosts/GPUs are already busy.

Record any CI/cluster budgets (time quotas, GPU hours) before launching work.
Set a per-task CPU minute / GPU minute budget that respects those limits.

捕获当前利用率：

```
uptime
```
```
ps -eo pcpu,cmd | head
```

nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv

记录哪些主机/GPU已处于繁忙状态。

在启动任务前记录CI/集群预算（时间配额、GPU时长）。
设置符合这些限制的每任务CPU分钟数/GPU分钟数预算。

Step 2: Narrow the Scope

步骤2：缩小范围

Avoid running "whole world" jobs after a small fix. Prefer diff-based or tag-based selective testing:
- ```
pytest -k
```
- Bazel target patterns
- ```
cargo test <module>
```
Batch low-level fixes so you can validate multiple changes with a single targeted command.
For GPU jobs, favor unit-scale smoke inputs or lower epoch counts before scheduling the full training/eval sweep.

修复小问题后避免运行"全量"任务。优先选择基于差异或标签的选择性测试：
- ```
pytest -k
```
- Bazel目标模式
- ```
cargo test <module>
```
批量处理底层修复，以便通过单个目标命令验证多项更改。
对于GPU任务，在调度完整的训练/评估任务之前，优先使用单元级的快速测试输入或更少的epoch次数。

Step 3: Instrument Before You Optimize

步骤3：优化前先做监控

Pick the right profiler/monitor:
- CPU work:
  - ```
  perf
```
- ```
intel vtune
```
  - ```
  cargo flamegraph
```
- language-specific profilers
- GPU work:
  - ```
  nvidia-smi dmon
```
- ```
nsys
```
  - ```
  nvprof
```
- DLProf
- framework timeline tracers
Capture kernel/ops timelines, memory footprints, and data pipeline latency so you have evidence when throttling or parallelizing.
Record hot paths + I/O bottlenecks in notes so future reruns can jump straight to the culprit.

选择合适的性能分析器/监控工具：
- CPU任务：
  - ```
  perf
```
- ```
intel vtune
```
  - ```
  cargo flamegraph
```
- 语言特定的性能分析器
- GPU任务：
  - ```
  nvidia-smi dmon
```
- ```
nsys
```
  - ```
  nvprof
```
- DLProf
- 框架时间线追踪器
捕获内核/操作时间线、内存占用和数据管道延迟，以便在限流或并行化时有依据。
在笔记中记录热点路径和I/O瓶颈，以便后续重新运行时可以直接定位问题根源。

Step 4: Throttle and Sequence Work

步骤4：限流与任务排序

Use
```
nice
```
,
```
ionice
```
, or Kubernetes/Slurm quotas to prevent starvation of shared nodes.
Chain heavy tasks with guardrails:
- Rerun only the failed test/module
- Then (optionally) escalate to the next-wider shard
- Reserve the full suite for the final gate
Stagger GPU kernels (smaller batch sizes or gradient accumulation) when memory pressure risks eviction; prefer checkpoint/restore over restarts.

使用
```
nice
```
、
```
ionice
```
或Kubernetes/Slurm配额防止共享节点资源被耗尽。
为重型任务添加防护措施并按顺序执行：
- 仅重新运行失败的测试/模块
- 然后（可选）扩展到下一个更广的分片
- 将全量套件保留为最终验证环节
当内存压力可能导致数据被逐出时，错开GPU内核执行（使用更小的批量大小或梯度累积）；优先选择检查点/恢复而非重启。

Step 5: Log Decisions and Next Steps

步骤5：记录决策与后续步骤

Conclude by documenting the commands that were run and their resource cost (duration, CPU%, GPU%), confirming whether they remained within the per-task budget. If a full suite or long training run was necessary, justify why selective or staged approaches were not feasible. Capture any follow-up tasks, such as adding a new test marker or profiling documentation, to simplify future sessions.

最后记录运行的命令及其资源成本（时长、CPU使用率、GPU使用率），确认是否符合每任务预算。如果必须运行全量套件或长时间训练任务，说明为何无法使用选择性或分阶段方法。记录任何后续任务，例如添加新的测试标记或性能分析文档，以简化未来的会话。

Output Expectations

输出预期

Brief summary covering:
- baseline metrics
- scope chosen
- instrumentation captured
- throttling tactics
- follow-up items
Concrete example(s) of what ran (e.g.):
- "reran
```
pytest tests/test_orders.py -k test_refund
```
  instead of
```
pytest -m slow
```
  "
- "profiled
```
nvidia-smi dmon
```
  output to prove GPU idle time before scaling"

简要总结，涵盖：
- 基准指标
- 选择的范围
- 捕获的监控数据
- 限流策略
- 后续事项
具体的运行示例（例如）：
- "重新运行
```
pytest tests/test_orders.py -k test_refund
```
  而非
```
pytest -m slow
```
  "
- "分析
```
nvidia-smi dmon
```
  输出以证明GPU空闲后再进行扩容"