cpu-gpu-performance

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Table of Contents

目录

CPU/GPU Performance Discipline

CPU/GPU性能规范

When To Use

适用场景

  • At the beginning of every session (auto-load alongside
    token-conservation
    ).
  • Whenever you plan to build, train, or test anything that could pin CPU cores or GPUs for more than a minute.
  • Before retrying a failing command that previously consumed significant resources.
  • 每次会话开始时(与
    token-conservation
    自动加载)。
  • 当你计划构建、训练或测试任何可能让CPU核心或GPU持续占用超过一分钟的内容时。
  • 在重试之前消耗大量资源的失败命令之前。

When NOT To Use

不适用场景

  • Simple operations with no resource impact
  • Quick single-file operations
  • 无资源影响的简单操作
  • 快速的单文件操作

Required TodoWrite Items

所需TodoWrite项

  1. cpu-gpu-performance:baseline
  2. cpu-gpu-performance:scope
  3. cpu-gpu-performance:instrument
  4. cpu-gpu-performance:throttle
  5. cpu-gpu-performance:log
  1. cpu-gpu-performance:baseline
  2. cpu-gpu-performance:scope
  3. cpu-gpu-performance:instrument
  4. cpu-gpu-performance:throttle
  5. cpu-gpu-performance:log

Step 1: Establish Current Baseline

步骤1:建立当前基准线

  • Capture current utilization:
    • uptime
    • ps -eo pcpu,cmd | head
    • nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv
    Note which hosts/GPUs are already busy.
  • Record any CI/cluster budgets (time quotas, GPU hours) before launching work.
  • Set a per-task CPU minute / GPU minute budget that respects those limits.
  • 捕获当前利用率:
    • uptime
    • ps -eo pcpu,cmd | head
    • nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv
    记录哪些主机/GPU已处于繁忙状态。
  • 在启动任务前记录CI/集群预算(时间配额、GPU时长)。
  • 设置符合这些限制的每任务CPU分钟数/GPU分钟数预算。

Step 2: Narrow the Scope

步骤2:缩小范围

  • Avoid running "whole world" jobs after a small fix. Prefer diff-based or tag-based selective testing:
    • pytest -k
    • Bazel target patterns
    • cargo test <module>
  • Batch low-level fixes so you can validate multiple changes with a single targeted command.
  • For GPU jobs, favor unit-scale smoke inputs or lower epoch counts before scheduling the full training/eval sweep.
  • 修复小问题后避免运行"全量"任务。优先选择基于差异或标签的选择性测试:
    • pytest -k
    • Bazel目标模式
    • cargo test <module>
  • 批量处理底层修复,以便通过单个目标命令验证多项更改。
  • 对于GPU任务,在调度完整的训练/评估任务之前,优先使用单元级的快速测试输入或更少的epoch次数。

Step 3: Instrument Before You Optimize

步骤3:优化前先做监控

  • Pick the right profiler/monitor:
    • CPU work:
      • perf
      • intel vtune
      • cargo flamegraph
      • language-specific profilers
    • GPU work:
      • nvidia-smi dmon
      • nsys
      • nvprof
      • DLProf
      • framework timeline tracers
  • Capture kernel/ops timelines, memory footprints, and data pipeline latency so you have evidence when throttling or parallelizing.
  • Record hot paths + I/O bottlenecks in notes so future reruns can jump straight to the culprit.
  • 选择合适的性能分析器/监控工具:
    • CPU任务:
      • perf
      • intel vtune
      • cargo flamegraph
      • 语言特定的性能分析器
    • GPU任务:
      • nvidia-smi dmon
      • nsys
      • nvprof
      • DLProf
      • 框架时间线追踪器
  • 捕获内核/操作时间线、内存占用和数据管道延迟,以便在限流或并行化时有依据。
  • 在笔记中记录热点路径和I/O瓶颈,以便后续重新运行时可以直接定位问题根源。

Step 4: Throttle and Sequence Work

步骤4:限流与任务排序

  • Use
    nice
    ,
    ionice
    , or Kubernetes/Slurm quotas to prevent starvation of shared nodes.
  • Chain heavy tasks with guardrails:
    • Rerun only the failed test/module
    • Then (optionally) escalate to the next-wider shard
    • Reserve the full suite for the final gate
  • Stagger GPU kernels (smaller batch sizes or gradient accumulation) when memory pressure risks eviction; prefer checkpoint/restore over restarts.
  • 使用
    nice
    ionice
    或Kubernetes/Slurm配额防止共享节点资源被耗尽。
  • 为重型任务添加防护措施并按顺序执行:
    • 仅重新运行失败的测试/模块
    • 然后(可选)扩展到下一个更广的分片
    • 将全量套件保留为最终验证环节
  • 当内存压力可能导致数据被逐出时,错开GPU内核执行(使用更小的批量大小或梯度累积);优先选择检查点/恢复而非重启。

Step 5: Log Decisions and Next Steps

步骤5:记录决策与后续步骤

Conclude by documenting the commands that were run and their resource cost (duration, CPU%, GPU%), confirming whether they remained within the per-task budget. If a full suite or long training run was necessary, justify why selective or staged approaches were not feasible. Capture any follow-up tasks, such as adding a new test marker or profiling documentation, to simplify future sessions.
最后记录运行的命令及其资源成本(时长、CPU使用率、GPU使用率),确认是否符合每任务预算。如果必须运行全量套件或长时间训练任务,说明为何无法使用选择性或分阶段方法。记录任何后续任务,例如添加新的测试标记或性能分析文档,以简化未来的会话。

Output Expectations

输出预期

  • Brief summary covering:
    • baseline metrics
    • scope chosen
    • instrumentation captured
    • throttling tactics
    • follow-up items
  • Concrete example(s) of what ran (e.g.):
    • "reran
      pytest tests/test_orders.py -k test_refund
      instead of
      pytest -m slow
      "
    • "profiled
      nvidia-smi dmon
      output to prove GPU idle time before scaling"
  • 简要总结,涵盖:
    • 基准指标
    • 选择的范围
    • 捕获的监控数据
    • 限流策略
    • 后续事项
  • 具体的运行示例(例如):
    • "重新运行
      pytest tests/test_orders.py -k test_refund
      而非
      pytest -m slow
      "
    • "分析
      nvidia-smi dmon
      输出以证明GPU空闲后再进行扩容"