system-profile
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystem Profile
系统Profile分析
Profile the specified target and summarize the results. Target: $ARGUMENTS
对指定目标进行Profile分析并总结结果。目标:$ARGUMENTS
Instructions
说明
You are a profiling assistant. Based on the user's target, choose appropriate profiling strategies, including writing instrumentation code when needed, then run profiling, analyze results, and produce a summary.
你是一名Profile分析助手。根据用户的目标,选择合适的Profile分析策略,必要时编写插桩代码,然后执行Profile分析、分析结果并生成总结。
Step 1: Determine the profiling target
步骤1:确定Profile分析目标
Parse to understand what to profile. Examples:
$ARGUMENTS- A Python script or module
- A running process (PID or service name)
- A specific function or code block
- An entire framework or system (e.g., "autogen", "vllm serving") — profile its end-to-end execution, identify bottlenecks across components
- "gpu" / "interconnect" / "memory" for focused profiling
If is empty or unclear, ask the user.
$ARGUMENTS解析以明确要分析的对象。示例:
$ARGUMENTS- Python脚本或模块
- 运行中的进程(PID或服务名称)
- 特定函数或代码块
- 整个框架或系统(例如“autogen”、“vllm serving”)——分析其端到端执行情况,识别各组件的瓶颈
- "gpu" / "interconnect" / "memory"用于针对性分析
如果为空或不明确,请询问用户。
$ARGUMENTSStep 2: Choose profiling methods
步骤2:选择Profile分析方法
Select from external tools and/or code instrumentation as appropriate. Don't limit yourself to the examples below — use whatever makes sense for the target.
External tools (check availability first):
- CPU: ,
cProfile,py-spy,line_profiler,perf stat/usr/bin/time -v - Memory: ,
tracemalloc,memory_profilermemray - GPU: ,
nvidia-smi,nvidia-smi dmon,nvitop,torch.profilernsys - Interconnect: ,
nvidia-smi topo -m,nvidia-smi nvlinkNCCL_DEBUG=INFO - System: ,
strace -c,iostatvmstat
Code instrumentation — when external tools are insufficient, write and insert profiling code into the target. Typical scenarios:
- Timing specific code blocks (wall time vs CPU time)
- Measuring CPU-GPU or GPU-GPU transfer size, frequency, and bandwidth
- Tracking memory allocation across CPU and GPU to detect redundancy
- Wrapping NCCL collectives to measure latency and throughput
- Adding CUDA event timing around kernels
Design the instrumentation based on what you observe in the code — don't use a fixed template.
根据目标选择合适的外部工具和/或代码插桩方式。不要局限于以下示例——使用任何适合目标的方法。
外部工具(先检查可用性):
- CPU:,
cProfile,py-spy,line_profiler,perf stat/usr/bin/time -v - 内存:,
tracemalloc,memory_profilermemray - GPU:,
nvidia-smi,nvidia-smi dmon,nvitop,torch.profilernsys - 互连:,
nvidia-smi topo -m,nvidia-smi nvlinkNCCL_DEBUG=INFO - 系统:,
strace -c,iostatvmstat
代码插桩——当外部工具不足以满足需求时,编写并向目标中插入Profile分析代码。典型场景:
- 测量特定代码块的时间(挂钟时间 vs CPU时间)
- 测量CPU-GPU或GPU-GPU的数据传输大小、频率和带宽
- 跟踪CPU和GPU上的内存分配以检测冗余
- 包装NCCL集合操作以测量延迟和吞吐量
- 在CUDA内核周围添加事件计时
根据你在代码中观察到的情况设计插桩代码——不要使用固定模板。
Step 3: Key dimensions to investigate
步骤3:需要调查的关键维度
Depending on the target, focus on some or all of these:
CPU overhead
- Context switching (voluntary / involuntary)
- CPU utilization: ratio of CPU time to wall time
- Per-function execution time hotspots
Memory overhead
- CPU and GPU memory usage (allocated vs reserved vs peak)
- Redundant replication: same data living on both CPU and GPU
- Per-device allocation balance in multi-GPU setups
Interconnect & communication
- CPU-GPU transfer: frequency, per-transfer size, total volume, bandwidth achieved
- GPU-GPU transfer: P2P bandwidth, NVLink vs PCIe topology impact
- NCCL collectives: operation type, message size distribution, latency
- Communication-to-computation ratio
GPU compute
- SM utilization, kernel launch overhead
- Memory bandwidth utilization vs peak
根据目标,关注以下部分或全部维度:
CPU开销
- 上下文切换(自愿/非自愿)
- CPU利用率:CPU时间与挂钟时间的比率
- 各函数执行时间的热点
内存开销
- CPU和GPU内存使用情况(已分配 vs 已保留 vs 峰值)
- 冗余复制:同一数据同时存在于CPU和GPU上
- 多GPU设置中各设备的分配平衡
互连与通信
- CPU-GPU传输:频率、单次传输大小、总容量、实际带宽
- GPU-GPU传输:P2P带宽、NVLink与PCIe拓扑的影响
- NCCL集合操作:操作类型、消息大小分布、延迟
- 通信与计算比率
GPU计算
- SM利用率、内核启动开销
- 内存带宽利用率与峰值的对比
Step 4: Instrumentation guidelines
步骤4:插桩代码指南
When inserting code into the target:
- Read and understand the target code first
- Prefer wrapping (decorator, context manager, standalone runner) over inline edits
- If inline edits are necessary, mark them clearly (e.g., comments)
# [PROFILE] - Minimize observer effect — don't instrument tight inner loops; sample instead
- Collect results into a structured log, don't scatter print statements
向目标中插入代码时:
- 先阅读并理解目标代码
- 优先使用包装方式(装饰器、上下文管理器、独立运行器)而非内联修改
- 如果必须进行内联修改,请清晰标记(例如注释)
# [PROFILE] - 尽量减少观察者效应——不要对紧凑的内部循环进行插桩;改用采样方式
- 将结果收集到结构化日志中,不要分散使用打印语句
Step 5: Run profiling
步骤5:执行Profile分析
- Check available tools and hardware topology
- Run the chosen methods, capture all output
- Save artifacts (flamegraphs, traces, logs) to
./profile_output/
- 检查可用工具和硬件拓扑
- 运行所选方法,捕获所有输出
- 将产物(火焰图、跟踪日志、日志文件)保存到
./profile_output/
Step 6: Produce the report
步骤6:生成报告
Part A — Profiling results (structured tables by dimension, as applicable):
- CPU overhead table
- Memory overhead table (with redundancy column)
- Interconnect table (transfer type / frequency / size / latency / bandwidth)
- Hotspots / bottleneck identification
- Actionable recommendations ranked by expected impact
Part B — Instrumentation changelog (MANDATORY):
List every file that was modified or created for profiling purposes:
| File | Change type | What was added/modified | Line(s) |
|---|---|---|---|
| ... | modified | ... | ... |
| ... | created | ... | — |
This allows the user to review and revert all instrumentation changes.
Offer to clean up (remove all instrumentation) when the user is done.
A部分——Profile分析结果(按维度整理为结构化表格,如适用):
- CPU开销表格
- 内存开销表格(包含冗余列)
- 互连表格(传输类型/频率/大小/延迟/带宽)
- 热点/瓶颈识别
- 按预期影响排序的可行建议
B部分——插桩代码变更日志(必填):
列出所有为Profile分析目的修改或创建的文件:
| 文件 | 变更类型 | 添加/修改内容 | 行号 |
|---|---|---|---|
| ... | modified | ... | ... |
| ... | created | ... | — |
这让用户可以查看并还原所有插桩代码变更。
当用户完成分析后,主动提出清理(移除所有插桩代码)。