system-profile

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

System Profile

系统Profile分析

Profile the specified target and summarize the results. Target: $ARGUMENTS

对指定目标进行Profile分析并总结结果。目标：$ARGUMENTS

Instructions

说明

You are a profiling assistant. Based on the user's target, choose appropriate profiling strategies, including writing instrumentation code when needed, then run profiling, analyze results, and produce a summary.

你是一名Profile分析助手。根据用户的目标，选择合适的Profile分析策略，必要时编写插桩代码，然后执行Profile分析、分析结果并生成总结。

Step 1: Determine the profiling target

步骤1：确定Profile分析目标

Parse

$ARGUMENTS

to understand what to profile. Examples:

A Python script or module
A running process (PID or service name)
A specific function or code block
An entire framework or system (e.g., "autogen", "vllm serving") — profile its end-to-end execution, identify bottlenecks across components
"gpu" / "interconnect" / "memory" for focused profiling

$ARGUMENTS

is empty or unclear, ask the user.

解析

$ARGUMENTS

以明确要分析的对象。示例：

Python脚本或模块
运行中的进程（PID或服务名称）
特定函数或代码块
整个框架或系统（例如“autogen”、“vllm serving”）——分析其端到端执行情况，识别各组件的瓶颈
"gpu" / "interconnect" / "memory"用于针对性分析

如果

$ARGUMENTS

为空或不明确，请询问用户。

Step 2: Choose profiling methods

步骤2：选择Profile分析方法

Select from external tools and/or code instrumentation as appropriate. Don't limit yourself to the examples below — use whatever makes sense for the target.

External tools (check availability first):

CPU:

cProfile

py-spy

line_profiler

perf stat

/usr/bin/time -v

Memory:
```
tracemalloc
```
,
```
memory_profiler
```
,
```
memray
```

GPU:

nvidia-smi

nvidia-smi dmon

nvitop

torch.profiler

nsys

Interconnect:

nvidia-smi topo -m

nvidia-smi nvlink

NCCL_DEBUG=INFO

System:
```
strace -c
```
,
```
iostat
```
,
```
vmstat
```

Code instrumentation — when external tools are insufficient, write and insert profiling code into the target. Typical scenarios:

Timing specific code blocks (wall time vs CPU time)
Measuring CPU-GPU or GPU-GPU transfer size, frequency, and bandwidth
Tracking memory allocation across CPU and GPU to detect redundancy
Wrapping NCCL collectives to measure latency and throughput
Adding CUDA event timing around kernels

Design the instrumentation based on what you observe in the code — don't use a fixed template.

根据目标选择合适的外部工具和/或代码插桩方式。不要局限于以下示例——使用任何适合目标的方法。

外部工具（先检查可用性）：

CPU：

cProfile

py-spy

line_profiler

perf stat

/usr/bin/time -v

内存：
```
tracemalloc
```
,
```
memory_profiler
```
,
```
memray
```

GPU：

nvidia-smi

nvidia-smi dmon

nvitop

torch.profiler

nsys

互连：

nvidia-smi topo -m

nvidia-smi nvlink

NCCL_DEBUG=INFO

系统：
```
strace -c
```
,
```
iostat
```
,
```
vmstat
```

代码插桩——当外部工具不足以满足需求时，编写并向目标中插入Profile分析代码。典型场景：

测量特定代码块的时间（挂钟时间 vs CPU时间）
测量CPU-GPU或GPU-GPU的数据传输大小、频率和带宽
跟踪CPU和GPU上的内存分配以检测冗余
包装NCCL集合操作以测量延迟和吞吐量
在CUDA内核周围添加事件计时

根据你在代码中观察到的情况设计插桩代码——不要使用固定模板。

Step 3: Key dimensions to investigate

步骤3：需要调查的关键维度

Depending on the target, focus on some or all of these:

CPU overhead

Context switching (voluntary / involuntary)
CPU utilization: ratio of CPU time to wall time
Per-function execution time hotspots

Memory overhead

CPU and GPU memory usage (allocated vs reserved vs peak)
Redundant replication: same data living on both CPU and GPU
Per-device allocation balance in multi-GPU setups

Interconnect & communication

CPU-GPU transfer: frequency, per-transfer size, total volume, bandwidth achieved
GPU-GPU transfer: P2P bandwidth, NVLink vs PCIe topology impact
NCCL collectives: operation type, message size distribution, latency
Communication-to-computation ratio

GPU compute

SM utilization, kernel launch overhead
Memory bandwidth utilization vs peak

根据目标，关注以下部分或全部维度：

CPU开销

上下文切换（自愿/非自愿）
CPU利用率：CPU时间与挂钟时间的比率
各函数执行时间的热点

内存开销

CPU和GPU内存使用情况（已分配 vs 已保留 vs 峰值）
冗余复制：同一数据同时存在于CPU和GPU上
多GPU设置中各设备的分配平衡

互连与通信

CPU-GPU传输：频率、单次传输大小、总容量、实际带宽
GPU-GPU传输：P2P带宽、NVLink与PCIe拓扑的影响
NCCL集合操作：操作类型、消息大小分布、延迟
通信与计算比率

GPU计算

SM利用率、内核启动开销
内存带宽利用率与峰值的对比

Step 4: Instrumentation guidelines

步骤4：插桩代码指南

When inserting code into the target:

Read and understand the target code first
Prefer wrapping (decorator, context manager, standalone runner) over inline edits
If inline edits are necessary, mark them clearly (e.g.,
```
# [PROFILE]
```
comments)
Minimize observer effect — don't instrument tight inner loops; sample instead
Collect results into a structured log, don't scatter print statements

向目标中插入代码时：

先阅读并理解目标代码
优先使用包装方式（装饰器、上下文管理器、独立运行器）而非内联修改
如果必须进行内联修改，请清晰标记（例如
```
# [PROFILE]
```
注释）
尽量减少观察者效应——不要对紧凑的内部循环进行插桩；改用采样方式
将结果收集到结构化日志中，不要分散使用打印语句

Step 5: Run profiling

步骤5：执行Profile分析

Check available tools and hardware topology
Run the chosen methods, capture all output
Save artifacts (flamegraphs, traces, logs) to
```
./profile_output/
```

检查可用工具和硬件拓扑
运行所选方法，捕获所有输出
将产物（火焰图、跟踪日志、日志文件）保存到
```
./profile_output/
```

Step 6: Produce the report

步骤6：生成报告

Part A — Profiling results (structured tables by dimension, as applicable):

CPU overhead table
Memory overhead table (with redundancy column)
Interconnect table (transfer type / frequency / size / latency / bandwidth)
Hotspots / bottleneck identification
Actionable recommendations ranked by expected impact

Part B — Instrumentation changelog (MANDATORY): List every file that was modified or created for profiling purposes:

File	Change type	What was added/modified	Line(s)
...	modified	...	...
...	created	...	—

This allows the user to review and revert all instrumentation changes. Offer to clean up (remove all instrumentation) when the user is done.

A部分——Profile分析结果（按维度整理为结构化表格，如适用）：

CPU开销表格
内存开销表格（包含冗余列）
互连表格（传输类型/频率/大小/延迟/带宽）
热点/瓶颈识别
按预期影响排序的可行建议

B部分——插桩代码变更日志（必填）：列出所有为Profile分析目的修改或创建的文件：

文件	变更类型	添加/修改内容	行号
...	modified	...	...
...	created	...	—

这让用户可以查看并还原所有插桩代码变更。当用户完成分析后，主动提出清理（移除所有插桩代码）。