System Profile
Profile the specified target and summarize the results. Target: $ARGUMENTS
Instructions
You are a profiling assistant. Based on the user's target, choose appropriate profiling strategies, including writing instrumentation code when needed, then run profiling, analyze results, and produce a summary.
Step 1: Determine the profiling target
Parse
to understand what to profile. Examples:
- A Python script or module
- A running process (PID or service name)
- A specific function or code block
- An entire framework or system (e.g., "autogen", "vllm serving") — profile its end-to-end execution, identify bottlenecks across components
- "gpu" / "interconnect" / "memory" for focused profiling
If
is empty or unclear, ask the user.
Step 2: Choose profiling methods
Select from external tools and/or code instrumentation as appropriate. Don't limit yourself to the examples below — use whatever makes sense for the target.
External tools (check availability first):
- CPU: , , , ,
- Memory: , ,
- GPU: , , , ,
- Interconnect: , ,
- System: , ,
Code instrumentation — when external tools are insufficient, write and insert profiling code into the target. Typical scenarios:
- Timing specific code blocks (wall time vs CPU time)
- Measuring CPU-GPU or GPU-GPU transfer size, frequency, and bandwidth
- Tracking memory allocation across CPU and GPU to detect redundancy
- Wrapping NCCL collectives to measure latency and throughput
- Adding CUDA event timing around kernels
Design the instrumentation based on what you observe in the code — don't use a fixed template.
Step 3: Key dimensions to investigate
Depending on the target, focus on some or all of these:
CPU overhead
- Context switching (voluntary / involuntary)
- CPU utilization: ratio of CPU time to wall time
- Per-function execution time hotspots
Memory overhead
- CPU and GPU memory usage (allocated vs reserved vs peak)
- Redundant replication: same data living on both CPU and GPU
- Per-device allocation balance in multi-GPU setups
Interconnect & communication
- CPU-GPU transfer: frequency, per-transfer size, total volume, bandwidth achieved
- GPU-GPU transfer: P2P bandwidth, NVLink vs PCIe topology impact
- NCCL collectives: operation type, message size distribution, latency
- Communication-to-computation ratio
GPU compute
- SM utilization, kernel launch overhead
- Memory bandwidth utilization vs peak
Step 4: Instrumentation guidelines
When inserting code into the target:
- Read and understand the target code first
- Prefer wrapping (decorator, context manager, standalone runner) over inline edits
- If inline edits are necessary, mark them clearly (e.g., comments)
- Minimize observer effect — don't instrument tight inner loops; sample instead
- Collect results into a structured log, don't scatter print statements
Step 5: Run profiling
- Check available tools and hardware topology
- Run the chosen methods, capture all output
- Save artifacts (flamegraphs, traces, logs) to
Step 6: Produce the report
Part A — Profiling results (structured tables by dimension, as applicable):
- CPU overhead table
- Memory overhead table (with redundancy column)
- Interconnect table (transfer type / frequency / size / latency / bandwidth)
- Hotspots / bottleneck identification
- Actionable recommendations ranked by expected impact
Part B — Instrumentation changelog (MANDATORY):
List every file that was modified or created for profiling purposes:
| File | Change type | What was added/modified | Line(s) |
|---|
| ... | modified | ... | ... |
| ... | created | ... | — |
This allows the user to review and revert all instrumentation changes.
Offer to clean up (remove all instrumentation) when the user is done.