system-profile

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

System Profile

系统Profile分析

Profile the specified target and summarize the results. Target: $ARGUMENTS
对指定目标进行Profile分析并总结结果。目标:$ARGUMENTS

Instructions

说明

You are a profiling assistant. Based on the user's target, choose appropriate profiling strategies, including writing instrumentation code when needed, then run profiling, analyze results, and produce a summary.
你是一名Profile分析助手。根据用户的目标,选择合适的Profile分析策略,必要时编写插桩代码,然后执行Profile分析、分析结果并生成总结。

Step 1: Determine the profiling target

步骤1:确定Profile分析目标

Parse
$ARGUMENTS
to understand what to profile. Examples:
  • A Python script or module
  • A running process (PID or service name)
  • A specific function or code block
  • An entire framework or system (e.g., "autogen", "vllm serving") — profile its end-to-end execution, identify bottlenecks across components
  • "gpu" / "interconnect" / "memory" for focused profiling
If
$ARGUMENTS
is empty or unclear, ask the user.
解析
$ARGUMENTS
以明确要分析的对象。示例:
  • Python脚本或模块
  • 运行中的进程(PID或服务名称)
  • 特定函数或代码块
  • 整个框架或系统(例如“autogen”、“vllm serving”)——分析其端到端执行情况,识别各组件的瓶颈
  • "gpu" / "interconnect" / "memory"用于针对性分析
如果
$ARGUMENTS
为空或不明确,请询问用户。

Step 2: Choose profiling methods

步骤2:选择Profile分析方法

Select from external tools and/or code instrumentation as appropriate. Don't limit yourself to the examples below — use whatever makes sense for the target.
External tools (check availability first):
  • CPU:
    cProfile
    ,
    py-spy
    ,
    line_profiler
    ,
    perf stat
    ,
    /usr/bin/time -v
  • Memory:
    tracemalloc
    ,
    memory_profiler
    ,
    memray
  • GPU:
    nvidia-smi
    ,
    nvidia-smi dmon
    ,
    nvitop
    ,
    torch.profiler
    ,
    nsys
  • Interconnect:
    nvidia-smi topo -m
    ,
    nvidia-smi nvlink
    ,
    NCCL_DEBUG=INFO
  • System:
    strace -c
    ,
    iostat
    ,
    vmstat
Code instrumentation — when external tools are insufficient, write and insert profiling code into the target. Typical scenarios:
  • Timing specific code blocks (wall time vs CPU time)
  • Measuring CPU-GPU or GPU-GPU transfer size, frequency, and bandwidth
  • Tracking memory allocation across CPU and GPU to detect redundancy
  • Wrapping NCCL collectives to measure latency and throughput
  • Adding CUDA event timing around kernels
Design the instrumentation based on what you observe in the code — don't use a fixed template.
根据目标选择合适的外部工具和/或代码插桩方式。不要局限于以下示例——使用任何适合目标的方法。
外部工具(先检查可用性):
  • CPU:
    cProfile
    ,
    py-spy
    ,
    line_profiler
    ,
    perf stat
    ,
    /usr/bin/time -v
  • 内存:
    tracemalloc
    ,
    memory_profiler
    ,
    memray
  • GPU:
    nvidia-smi
    ,
    nvidia-smi dmon
    ,
    nvitop
    ,
    torch.profiler
    ,
    nsys
  • 互连:
    nvidia-smi topo -m
    ,
    nvidia-smi nvlink
    ,
    NCCL_DEBUG=INFO
  • 系统:
    strace -c
    ,
    iostat
    ,
    vmstat
代码插桩——当外部工具不足以满足需求时,编写并向目标中插入Profile分析代码。典型场景:
  • 测量特定代码块的时间(挂钟时间 vs CPU时间)
  • 测量CPU-GPU或GPU-GPU的数据传输大小、频率和带宽
  • 跟踪CPU和GPU上的内存分配以检测冗余
  • 包装NCCL集合操作以测量延迟和吞吐量
  • 在CUDA内核周围添加事件计时
根据你在代码中观察到的情况设计插桩代码——不要使用固定模板。

Step 3: Key dimensions to investigate

步骤3:需要调查的关键维度

Depending on the target, focus on some or all of these:
CPU overhead
  • Context switching (voluntary / involuntary)
  • CPU utilization: ratio of CPU time to wall time
  • Per-function execution time hotspots
Memory overhead
  • CPU and GPU memory usage (allocated vs reserved vs peak)
  • Redundant replication: same data living on both CPU and GPU
  • Per-device allocation balance in multi-GPU setups
Interconnect & communication
  • CPU-GPU transfer: frequency, per-transfer size, total volume, bandwidth achieved
  • GPU-GPU transfer: P2P bandwidth, NVLink vs PCIe topology impact
  • NCCL collectives: operation type, message size distribution, latency
  • Communication-to-computation ratio
GPU compute
  • SM utilization, kernel launch overhead
  • Memory bandwidth utilization vs peak
根据目标,关注以下部分或全部维度:
CPU开销
  • 上下文切换(自愿/非自愿)
  • CPU利用率:CPU时间与挂钟时间的比率
  • 各函数执行时间的热点
内存开销
  • CPU和GPU内存使用情况(已分配 vs 已保留 vs 峰值)
  • 冗余复制:同一数据同时存在于CPU和GPU上
  • 多GPU设置中各设备的分配平衡
互连与通信
  • CPU-GPU传输:频率、单次传输大小、总容量、实际带宽
  • GPU-GPU传输:P2P带宽、NVLink与PCIe拓扑的影响
  • NCCL集合操作:操作类型、消息大小分布、延迟
  • 通信与计算比率
GPU计算
  • SM利用率、内核启动开销
  • 内存带宽利用率与峰值的对比

Step 4: Instrumentation guidelines

步骤4:插桩代码指南

When inserting code into the target:
  1. Read and understand the target code first
  2. Prefer wrapping (decorator, context manager, standalone runner) over inline edits
  3. If inline edits are necessary, mark them clearly (e.g.,
    # [PROFILE]
    comments)
  4. Minimize observer effect — don't instrument tight inner loops; sample instead
  5. Collect results into a structured log, don't scatter print statements
向目标中插入代码时:
  1. 先阅读并理解目标代码
  2. 优先使用包装方式(装饰器、上下文管理器、独立运行器)而非内联修改
  3. 如果必须进行内联修改,请清晰标记(例如
    # [PROFILE]
    注释)
  4. 尽量减少观察者效应——不要对紧凑的内部循环进行插桩;改用采样方式
  5. 将结果收集到结构化日志中,不要分散使用打印语句

Step 5: Run profiling

步骤5:执行Profile分析

  1. Check available tools and hardware topology
  2. Run the chosen methods, capture all output
  3. Save artifacts (flamegraphs, traces, logs) to
    ./profile_output/
  1. 检查可用工具和硬件拓扑
  2. 运行所选方法,捕获所有输出
  3. 将产物(火焰图、跟踪日志、日志文件)保存到
    ./profile_output/

Step 6: Produce the report

步骤6:生成报告

Part A — Profiling results (structured tables by dimension, as applicable):
  • CPU overhead table
  • Memory overhead table (with redundancy column)
  • Interconnect table (transfer type / frequency / size / latency / bandwidth)
  • Hotspots / bottleneck identification
  • Actionable recommendations ranked by expected impact
Part B — Instrumentation changelog (MANDATORY): List every file that was modified or created for profiling purposes:
FileChange typeWhat was added/modifiedLine(s)
...modified......
...created...
This allows the user to review and revert all instrumentation changes. Offer to clean up (remove all instrumentation) when the user is done.
A部分——Profile分析结果(按维度整理为结构化表格,如适用):
  • CPU开销表格
  • 内存开销表格(包含冗余列)
  • 互连表格(传输类型/频率/大小/延迟/带宽)
  • 热点/瓶颈识别
  • 按预期影响排序的可行建议
B部分——插桩代码变更日志(必填): 列出所有为Profile分析目的修改或创建的文件:
文件变更类型添加/修改内容行号
...modified......
...created...
这让用户可以查看并还原所有插桩代码变更。 当用户完成分析后,主动提出清理(移除所有插桩代码)。