perf-nsight-systems

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Nsight Systems Profiling

Nsight Systems 性能分析

NVIDIA Nsight Systems (
nsys
) is a system-level performance analysis tool that captures CPU/GPU activity timelines, API traces, and OS-level events. Unlike Nsight Compute (kernel-level),
nsys
shows the big picture — how kernels, memory transfers, communication, and CPU work overlap in time.
NVIDIA Nsight Systems (
nsys
) 是一款系统级性能分析工具,可捕获CPU/GPU活动时间线、API追踪以及操作系统级事件。与Nsight Compute(内核级分析工具)不同,
nsys
呈现全局视图——展示内核、内存传输、通信和CPU工作在时间上的重叠情况。

When to Use

使用场景

Reach for this skill when you encounter:
  • Triggers: User wants to profile a training script end-to-end, analyze GPU utilization, find pipeline bottlenecks, check communication/compute overlap, or interpret
    .nsys-rep
    reports
  • Symptoms: Training slower than expected, GPU idle between iterations, need to understand where time is spent across CPU and GPU, poor scaling in distributed training
  • Keywords: "nsys", "nsight systems", "GPU timeline", "GPU utilization", "kernel launch overhead", "training profiling", "NCCL overlap", "nsys-rep", "cuda trace", "GPU idle", "pipeline stall", "data loading bottleneck"
Do NOT use this skill for:
  • Kernel-level optimization (use Nsight Compute /
    ncu
    instead)
  • GPU hardware metrics like SM throughput, cache hit rates (use
    ncu
    )
  • GPU monitoring without profiling (use
    nvidia-smi
    )
  • Non-CLI usage (GUI workflows, IDE integration) — consult official docs
当遇到以下情况时,可使用本技能:
  • 触发场景:用户需要对训练脚本进行端到端性能分析、分析GPU利用率、定位流水线瓶颈、检查通信/计算重叠情况,或解读
    .nsys-rep
    报告
  • 症状表现:训练速度低于预期、迭代间GPU空闲、需要了解CPU和GPU的时间消耗分布、分布式训练扩展性差
  • 关键词:"nsys", "nsight systems", "GPU timeline", "GPU利用率", "内核启动开销", "训练性能分析", "NCCL overlap", "nsys-rep", "cuda trace", "GPU空闲", "流水线停滞", "数据加载瓶颈"
请勿将本技能用于:
  • 内核级优化(请改用Nsight Compute /
    ncu
  • GPU硬件指标(如SM吞吐量、缓存命中率,使用
    ncu
  • 无性能分析的GPU监控(使用
    nvidia-smi
  • 非命令行使用场景(GUI工作流、IDE集成)——请参考官方文档

Requirements

依赖要求

DependencyVersionNotes
CUDA Toolkit>=11.0Includes
nsys
nsys
binary
Match CUDA versionVerify with
nsys -v
NVIDIA GPUAny supported
Permissions:
nsys
may require
sudo
or
CAP_SYS_ADMIN
for system-wide tracing and GPU metrics. In containers, use
--privileged
or
--cap-add=SYS_ADMIN
.
依赖项版本说明
CUDA Toolkit>=11.0包含
nsys
工具
nsys
二进制文件
匹配CUDA版本使用
nsys -v
验证
NVIDIA GPU任意支持型号
权限:
nsys
可能需要
sudo
CAP_SYS_ADMIN
权限以进行系统级追踪和GPU指标采集。在容器环境中,需使用
--privileged
--cap-add=SYS_ADMIN
参数。

Reporting Principles

报告原则

Every number must have an authoritative source. When presenting timing data, kernel counts, API call durations, or any quantitative metric, always cite the source:
nsys stats
report output,
nsys analyze
rule output, exported SQLite query result, recipe CSV, or raw command output. Show the actual command and its output before interpreting. Never synthesize, estimate, or extrapolate numbers that did not come from a tool output.
Use
nsys stats
for structured analysis, not raw trace data.
Always extract metrics via targeted
nsys stats -r <report>
commands rather than trying to read or interpret
.nsys-rep
files directly. Stats reports produce compact, tabular summaries; raw trace data can be enormous (especially with backtraces or verbose API tracing). Run the smallest set of reports needed for the task, then request additional reports only if the initial results raise questions.
所有数据必须有权威来源。当展示计时数据、内核数量、API调用时长或任何量化指标时,必须注明来源:
nsys stats
报告输出、
nsys analyze
规则输出、导出的SQLite查询结果、recipe CSV文件或原始命令输出。在解读前,请先展示实际命令及其输出内容。切勿合成、估算或推断未来自工具输出的数据。
使用
nsys stats
进行结构化分析,而非原始追踪数据
。始终通过定向的
nsys stats -r <report>
命令提取指标,而非直接读取或解读
.nsys-rep
文件。Stats报告生成紧凑的表格摘要;原始追踪数据可能非常庞大(尤其是包含回溯或详细API追踪时)。仅运行任务所需的最小报告集合,仅当初始结果产生疑问时,再请求额外报告。

Workflows

工作流

Workflow 1: Profile a DL Training Script

工作流1:深度学习训练脚本性能分析

Goal: Capture a clean, focused profile of steady-state training iterations.
Step 1 — Add profiler markers to your training script to skip warmup:
python
undefined
目标:捕获干净、聚焦的稳态训练迭代性能数据。
步骤1 — 在训练脚本中添加性能分析标记,跳过预热阶段:
python
undefined

In training script

训练脚本中

for i, batch in enumerate(train_loader): if i == warmup_iters: torch.cuda.cudart().cudaProfilerStart() train_step(model, batch) if i == warmup_iters + profile_iters: torch.cuda.cudart().cudaProfilerStop() break

**Step 2 — Profile with `cudaProfilerApi` capture range:**

```bash
nsys profile -c cudaProfilerApi \
    -t cuda,nvtx,cudnn,cublas \
    --pytorch=autograd-nvtx \
    -o train_profile -- python train.py
This captures only steady-state iterations — no warmup, no initialization noise.
Note:
-t cuda,nvtx,cudnn,cublas
enables API-specific tracing. By default,
-t cuda
only traces the CUDA runtime/driver layer — you see kernel names and launch times but cannot attribute them to higher-level libraries. Adding
cudnn
and
cublas
traces the library-level API calls, letting you distinguish convolution time (cuDNN) from GEMM time (cuBLAS) and measure library overhead separately from raw kernel execution.
Step 3 — Quick summary:
bash
nsys stats -r cuda_gpu_kern_sum,cuda_api_sum,cuda_gpu_mem_time_sum \
    train_profile.nsys-rep
When you traced library APIs (
cudnn
,
cublas
in
-t
), also run the library-specific reports to see API-level overhead (workspace allocation, algorithm selection) separately from raw kernel execution:
bash
nsys stats -r cudnn_api_sum,cublas_api_sum train_profile.nsys-rep
Step 4 — Detect anti-patterns:
bash
nsys analyze -r all train_profile.nsys-rep
Step 5 — Dig deeper based on findings. See Tier 2 references.
for i, batch in enumerate(train_loader): if i == warmup_iters: torch.cuda.cudart().cudaProfilerStart() train_step(model, batch) if i == warmup_iters + profile_iters: torch.cuda.cudart().cudaProfilerStop() break

**步骤2 — 使用`cudaProfilerApi`捕获范围进行性能分析:**

```bash
nsys profile -c cudaProfilerApi \
    -t cuda,nvtx,cudnn,cublas \
    --pytorch=autograd-nvtx \
    -o train_profile -- python train.py
此命令仅捕获稳态迭代数据——不包含预热和初始化阶段的干扰数据。
注意:
-t cuda,nvtx,cudnn,cublas
启用API定向追踪。默认情况下,
-t cuda
仅追踪CUDA运行时/驱动层——你可以看到内核名称和启动时间,但无法将其归因于更高层级的库。添加
cudnn
cublas
参数后,会追踪库级API调用,让你区分卷积时间(cuDNN)与GEMM时间(cuBLAS),并单独测量库开销与原始内核执行时间。
步骤3 — 快速生成摘要:
bash
nsys stats -r cuda_gpu_kern_sum,cuda_api_sum,cuda_gpu_mem_time_sum \
    train_profile.nsys-rep
当你追踪了库API(
-t
参数中包含
cudnn
cublas
),还需运行库专属报告,以查看API级开销(工作区分配、算法选择)与原始内核执行时间的区别:
bash
nsys stats -r cudnn_api_sum,cublas_api_sum train_profile.nsys-rep
步骤4 — 检测反模式:
bash
nsys analyze -r all train_profile.nsys-rep
步骤5 — 根据发现深入分析,参考Tier 2参考资料。

Workflow 2: Diagnose GPU Idle Time

工作流2:诊断GPU空闲时间

Goal: Find why the GPU is idle between training iterations.
Step 1 — Profile with OS runtime tracing:
bash
nsys profile -t cuda,nvtx,osrt \
    --pytorch=autograd-nvtx \
    -o idle_debug -- python train.py
Step 2 — Check GPU gaps and utilization:
bash
nsys analyze -r gpu_gaps,gpu_time_util idle_debug.nsys-rep
Step 3 — Check kernel launch phases:
bash
nsys stats -r cuda_kern_exec_sum idle_debug.nsys-rep
High queue time = GPU was busy (not the issue). Near-zero queue time for all kernels = GPU was starved (host not submitting work fast enough).
Step 4 — Common causes and fixes:
GPU idle causeEvidenceFix
Slow data loadingCPU busy in DataLoader during gapsIncrease
num_workers
, use
pin_memory=True
Synchronous memcpy
cuda_memcpy_sync
rule fires
Use
non_blocking=True
transfers
Over-synchronizationFrequent
cudaDeviceSynchronize
in trace
Remove unnecessary sync calls
Host-side computationCPU sampling shows compute during gapsMove to GPU or overlap with async ops
Python GIL contentionGIL trace shows contentionUse multiprocessing, reduce Python overhead
目标:找出训练迭代间GPU空闲的原因。
步骤1 — 启用操作系统运行时追踪进行性能分析:
bash
nsys profile -t cuda,nvtx,osrt \
    --pytorch=autograd-nvtx \
    -o idle_debug -- python train.py
步骤2 — 检查GPU间隙与利用率:
bash
nsys analyze -r gpu_gaps,gpu_time_util idle_debug.nsys-rep
步骤3 — 检查内核启动阶段:
bash
nsys stats -r cuda_kern_exec_sum idle_debug.nsys-rep
高队列时间 = GPU处于忙碌状态(非问题所在)。所有内核队列时间接近零 = GPU处于饥饿状态(主机提交工作速度不足)。
步骤4 — 常见原因与修复方案:
GPU空闲原因证据修复方案
数据加载缓慢间隙期间CPU在DataLoader中忙碌增加
num_workers
,启用
pin_memory=True
同步内存拷贝
cuda_memcpy_sync
规则触发
使用
non_blocking=True
进行内存传输
过度同步追踪中频繁出现
cudaDeviceSynchronize
移除不必要的同步调用
主机端计算CPU采样显示间隙期间存在计算操作将计算迁移至GPU,或与异步操作重叠执行
Python GIL竞争GIL追踪显示竞争情况使用多进程,减少Python开销

Workflow 3: Profile Distributed Training

工作流3:分布式训练性能分析

Goal: Profile multi-GPU/multi-node training with communication analysis.
Step 1 — Collect per-rank profiles:
bash
nsys profile -t cuda,nvtx,mpi,ucx \
    --pytorch=autograd-nvtx \
    -o profile_%q{RANK} \
    -- torchrun --nproc_per_node=8 train.py
Step 2 — Analyze NCCL communication/compute overlap:
bash
nsys recipe nccl_gpu_overlap_trace -- profile_*.nsys-rep
nsys recipe nccl_gpu_time_util_map -- profile_*.nsys-rep
Step 3 — Check per-rank utilization:
bash
nsys recipe cuda_gpu_time_util_map -- profile_*.nsys-rep
Step 4 — Check for stragglers:
Compare
cuda_gpu_kern_sum
across ranks. If one rank is slower, check its network and data loading patterns.
目标:对多GPU/多节点训练进行通信分析。
步骤1 — 收集各rank的性能分析数据:
bash
nsys profile -t cuda,nvtx,mpi,ucx \
    --pytorch=autograd-nvtx \
    -o profile_%q{RANK} \
    -- torchrun --nproc_per_node=8 train.py
步骤2 — 分析NCCL通信/计算重叠情况:
bash
nsys recipe nccl_gpu_overlap_trace -- profile_*.nsys-rep
nsys recipe nccl_gpu_time_util_map -- profile_*.nsys-rep
步骤3 — 检查各rank的利用率:
bash
nsys recipe cuda_gpu_time_util_map -- profile_*.nsys-rep
步骤4 — 检查掉队节点:
对比各rank的
cuda_gpu_kern_sum
结果。若某一rank速度较慢,检查其网络和数据加载模式。

Workflow 4: Analyze Iteration Time Consistency

工作流4:分析迭代时间一致性

Goal: Check whether training iterations are stable or have outliers.
bash
undefined
目标:检查训练迭代是否稳定,是否存在异常值。
bash
undefined

Profile with NVTX iteration markers

使用NVTX迭代标记进行性能分析

nsys profile --pytorch=autograd-nvtx -t cuda,nvtx
-o iter_check -- python train.py
nsys profile --pytorch=autograd-nvtx -t cuda,nvtx
-o iter_check -- python train.py

Check iteration timing distribution

检查迭代时间分布

nsys stats -r nvtx_pushpop_sum iter_check.nsys-rep
nsys stats -r nvtx_pushpop_sum iter_check.nsys-rep

Check GPU projection per NVTX range

检查每个NVTX范围的GPU投影

nsys stats -r nvtx_gpu_proj_sum iter_check.nsys-rep
nsys stats -r nvtx_gpu_proj_sum iter_check.nsys-rep

Visual pace analysis

可视化节奏分析

nsys recipe nvtx_pace -- iter_check.nsys-rep

High StdDev in iteration duration indicates inconsistency — investigate
outlier iterations on the timeline.
nsys recipe nvtx_pace -- iter_check.nsys-rep

迭代时长的高标准差表示一致性差——需在时间线上调查异常迭代。

Workflow 5: Attribute Kernels to Source Code via Stack Traces

工作流5:通过堆栈追踪将内核归因于源代码

Goal: Identify which Python function or code path triggers expensive GPU kernels.
Step 1 — Profile with backtrace collection:
bash
nsys profile -t cuda,nvtx \
    --backtrace=cuda \
    --python-backtrace=lbr \
    --pytorch=autograd-nvtx \
    -o stacktrace_profile -- python train.py
  • --backtrace=cuda
    : Captures CUDA API call stacks (C/C++ frames) so each
    cudaLaunchKernel
    shows the host-side call chain that triggered it.
  • --python-backtrace=lbr
    : Captures Python-level call stacks, correlating GPU work back to specific Python functions (e.g.,
    compute_attention
    vs
    compute_ffn
    ).
Step 2 — Get kernel summary and NVTX attribution:
Use targeted stats reports to identify top kernels and their NVTX context:
bash
undefined
目标:识别触发高开销GPU内核的Python函数或代码路径。
步骤1 — 开启回溯收集进行性能分析:
bash
nsys profile -t cuda,nvtx \
    --backtrace=cuda \
    --python-backtrace=lbr \
    --pytorch=autograd-nvtx \
    -o stacktrace_profile -- python train.py
  • --backtrace=cuda
    :捕获CUDA API调用堆栈(C/C++帧),使每个
    cudaLaunchKernel
    都显示触发它的主机端调用链。
  • --python-backtrace=lbr
    :捕获Python级调用堆栈,将GPU工作与特定Python函数关联(例如
    compute_attention
    vs
    compute_ffn
    )。
步骤2 — 获取内核摘要与NVTX归因:
使用定向stats报告识别顶级内核及其NVTX上下文:
bash
undefined

Top kernels by total GPU time

按总GPU时间排序的顶级内核

nsys stats -r cuda_gpu_kern_sum stacktrace_profile.nsys-rep
nsys stats -r cuda_gpu_kern_sum stacktrace_profile.nsys-rep

Kernels attributed to NVTX ranges (maps kernels to annotated code regions)

归因于NVTX范围的内核(将内核映射到带注释的代码区域)

nsys stats -r nvtx_kern_sum stacktrace_profile.nsys-rep

The `nvtx_kern_sum` report (requires `--pytorch=autograd-nvtx` or manual NVTX
annotations) maps each kernel to its enclosing NVTX range, directly showing
which Python function or autograd op launched it. This is more efficient than
manually cross-referencing raw backtrace data.

**Step 3 — For PyTorch models**, `--pytorch=autograd-nvtx` automatically wraps
each autograd op in an NVTX range. Combined with backtrace, this maps:
GPU kernel → CUDA API call → Python function → PyTorch autograd op.

**When to use**: Workloads with multiple code paths launching similar kernels
(e.g., attention vs FFN both calling GEMM). Stack traces disambiguate which
caller is responsible for the dominant kernel time.
nsys stats -r nvtx_kern_sum stacktrace_profile.nsys-rep

`nvtx_kern_sum`报告(需要`--pytorch=autograd-nvtx`或手动NVTX注释)将每个内核映射到其所属的NVTX范围,直接显示哪个Python函数或autograd操作启动了该内核。这比手动交叉引用原始回溯数据更高效。

**步骤3 — 对于PyTorch模型**,`--pytorch=autograd-nvtx`会自动将每个autograd操作包装在NVTX范围内。结合回溯功能,可实现映射:GPU内核 → CUDA API调用 → Python函数 → PyTorch autograd操作。

**适用场景**:存在多个代码路径启动相似内核的工作负载(例如注意力和FFN都调用GEMM)。堆栈追踪可明确哪个调用方是内核时间的主要贡献者。

Output Formats

输出格式

Report files (
.nsys-rep
): Binary format, viewable in GUI or processed with
nsys stats
,
nsys analyze
,
nsys export
,
nsys recipe
.
Stats output formats:
column
(terminal),
csv
,
json
,
table
,
tsv
,
hdoc
,
htable
.
Export formats:
sqlite
(SQL queries),
arrow
/
parquetdir
(Pandas/Dask),
hdf
,
jsonlines
,
text
.
Recipe output: Directory with CSV/Parquet data + Plotly HTML visualizations
  • .nsys-analysis
    (Jupyter notebook).
Key stats report columns:
ReportKey columns
cuda_gpu_kern_sum
Time%, Total Time, Instances, Kernel Name
cuda_api_sum
Time%, Total Time, Num Calls, API Name
cuda_kern_exec_sum
API Time, Queue Time, Kernel Time
cuda_gpu_mem_time_sum
Time%, Total Time, Operations, Direction
nvtx_gpu_proj_sum
Projected Duration, Original Duration, GPU Op Count
报告文件
.nsys-rep
):二进制格式,可在GUI中查看,或使用
nsys stats
nsys analyze
nsys export
nsys recipe
处理。
Stats输出格式
column
(终端)、
csv
json
table
tsv
hdoc
htable
导出格式
sqlite
(SQL查询)、
arrow
/
parquetdir
(Pandas/Dask)、
hdf
jsonlines
text
Recipe输出:包含CSV/Parquet数据 + Plotly HTML可视化 +
.nsys-analysis
(Jupyter notebook)的目录。
关键Stats报告列:
报告关键列
cuda_gpu_kern_sum
Time%, Total Time, Instances, Kernel Name
cuda_api_sum
Time%, Total Time, Num Calls, API Name
cuda_kern_exec_sum
API Time, Queue Time, Kernel Time
cuda_gpu_mem_time_sum
Time%, Total Time, Operations, Direction
nvtx_gpu_proj_sum
Projected Duration, Original Duration, GPU Op Count

Examples

示例

Example 1: Quick DL Profile and Summary

示例1:快速深度学习性能分析与摘要

bash
undefined
bash
undefined

Profile

性能分析

nsys profile -t cuda,nvtx,cudnn,cublas
--pytorch=autograd-nvtx --stats=true
-o quick_profile -- python train.py
nsys profile -t cuda,nvtx,cudnn,cublas
--pytorch=autograd-nvtx --stats=true
-o quick_profile -- python train.py

Auto-generates stats at the end of profiling

在性能分析结束时自动生成stats报告

undefined
undefined

Example 2: Detect Sync Memcpy in DataLoader

示例2:检测DataLoader中的同步内存拷贝

bash
nsys profile -t cuda,nvtx -o dataloader_check -- python train.py
nsys analyze -r cuda_memcpy_sync,cuda_memcpy_async dataloader_check.nsys-rep
If flagged, fix with:
python
loader = DataLoader(dataset, pin_memory=True, num_workers=4)
tensor_gpu = tensor_cpu.to(device, non_blocking=True)
bash
nsys profile -t cuda,nvtx -o dataloader_check -- python train.py
nsys analyze -r cuda_memcpy_sync,cuda_memcpy_async dataloader_check.nsys-rep
若被标记,修复方案:
python
loader = DataLoader(dataset, pin_memory=True, num_workers=4)
tensor_gpu = tensor_cpu.to(device, non_blocking=True)

Example 3: Multi-Node NCCL Analysis

示例3:多节点NCCL分析

bash
undefined
bash
undefined

Collect

收集数据

nsys profile -t cuda,nvtx,mpi -o rank_%q{RANK}
-- torchrun --nproc_per_node=8 train.py
nsys profile -t cuda,nvtx,mpi -o rank_%q{RANK}
-- torchrun --nproc_per_node=8 train.py

Analyze overlap

分析重叠情况

nsys recipe nccl_gpu_overlap_trace -- rank_*.nsys-rep
nsys recipe nccl_gpu_overlap_trace -- rank_*.nsys-rep

Visualize

可视化

nsys recipe nccl_gpu_time_util_map -- rank_*.nsys-rep
undefined
nsys recipe nccl_gpu_time_util_map -- rank_*.nsys-rep
undefined

Example 4: API-Level Breakdown (cuDNN vs cuBLAS)

示例4:API级分解(cuDNN vs cuBLAS)

bash
undefined
bash
undefined

Profile with library-level tracing

启用库级追踪进行性能分析

nsys profile -t cuda,nvtx,cudnn,cublas
-o api_breakdown -- python model.py
nsys profile -t cuda,nvtx,cudnn,cublas
-o api_breakdown -- python model.py

cuDNN API summary (convolution calls)

cuDNN API摘要(卷积调用)

nsys stats -r cudnn_api_sum api_breakdown.nsys-rep
nsys stats -r cudnn_api_sum api_breakdown.nsys-rep

cuBLAS API summary (GEMM calls)

cuBLAS API摘要(GEMM调用)

nsys stats -r cublas_api_sum api_breakdown.nsys-rep
nsys stats -r cublas_api_sum api_breakdown.nsys-rep

Compare with kernel-level view

与内核级视图对比

nsys stats -r cuda_gpu_kern_sum api_breakdown.nsys-rep

The API-level reports (`cudnn_api_sum`, `cublas_api_sum`) show time spent in
library calls including overhead (workspace allocation, algorithm selection),
while `cuda_gpu_kern_sum` shows only raw GPU kernel execution. The difference
reveals library-side overhead.
nsys stats -r cuda_gpu_kern_sum api_breakdown.nsys-rep

API级报告(`cudnn_api_sum`、`cublas_api_sum`)显示库调用的耗时(包括开销:工作区分配、算法选择),而`cuda_gpu_kern_sum`仅显示原始GPU内核执行时间。两者的差值即为库端开销。

Error Handling

错误处理

ErrorCauseFix
nsys: command not found
Not in PATH
export PATH=$PATH:/usr/local/cuda/bin
Permission denied
or
requires root
Needs elevated privileges
sudo nsys ...
or
--cap-add=SYS_ADMIN
in containers
No CUDA activity capturedApp didn't use GPU during collection windowAdjust
--delay
/
--duration
, or use
cudaProfilerApi
capture range
Report file very largeLong profile with many APIs tracedUse focused capture (
-c cudaProfilerApi
), reduce
--duration
--pytorch
has no effect
Wrong nsys version or Python envVerify nsys version supports
--pytorch
; check Python is in PATH
nsys stats
shows empty reports
No matching activity in reportCheck
--trace
flags included the right APIs
MPI rank profiles out of syncClock skew between nodesUse NTP sync; analyze per-rank independently
cudaProfilerStart
not captured
Missing
-c cudaProfilerApi
flag
Add
--capture-range=cudaProfilerApi
Recipe fails with import errorMissing Python dependenciesInstall recipe dependencies:
pip install pandas plotly
错误原因修复方案
nsys: command not found
未在PATH中
export PATH=$PATH:/usr/local/cuda/bin
Permission denied
requires root
需要提升权限使用
sudo nsys ...
,或在容器中添加
--cap-add=SYS_ADMIN
参数
未捕获到CUDA活动采集窗口内应用未使用GPU调整
--delay
/
--duration
参数,或使用
cudaProfilerApi
捕获范围
报告文件过大长时间性能分析且追踪了大量API使用聚焦捕获(
-c cudaProfilerApi
),缩短
--duration
--pytorch
参数无效
nsys版本不兼容或Python环境问题验证nsys版本支持
--pytorch
;检查Python是否在PATH中
nsys stats
显示空报告
报告中无匹配活动检查
--trace
参数是否包含正确的API
MPI rank报告不同步节点间时钟偏差使用NTP同步;独立分析每个rank的报告
cudaProfilerStart
未被捕获
缺少
-c cudaProfilerApi
参数
添加
--capture-range=cudaProfilerApi
Recipe执行失败并提示导入错误缺少Python依赖安装Recipe依赖:
pip install pandas plotly

Finding More Information

获取更多信息

Tier 1: This File (SKILL.md)

Tier 1:本文档(SKILL.md)

You are reading it now. The workflows and error table above cover the most common DL profiling tasks. Search this file first.
你正在阅读本文档。上述工作流和错误表涵盖了最常见的深度学习性能分析任务。请先在本文档中搜索。

Tier 2: references/ Directory

Tier 2:references/ 目录

Grep for keywords across
references/
— headers are grep-friendly:
  • references/cli-profiling.md
    — Complete
    nsys profile
    flags for DL
  • references/cli-post-collection.md
    nsys stats
    ,
    analyze
    ,
    export
    ,
    recipe
    commands
  • references/app-preparation.md
    — Focused profiling, NVTX markers, PyTorch patterns
  • references/stats-reports.md
    — CUDA statistical report columns and meanings
  • references/expert-systems.md
    — Expert system rules, anti-pattern detection
  • references/recipes-dl.md
    — DL-relevant advanced recipes with examples
  • references/nvtx-analysis.md
    — NVTX statistical reports for annotated code
How to search:
  1. Grep
    for your keyword across
    references/
  2. Read
    only the file that Grep points to
references/
目录中搜索关键词——标题支持grep检索:
  • references/cli-profiling.md
    — 适用于深度学习的完整
    nsys profile
    参数说明
  • references/cli-post-collection.md
    nsys stats
    analyze
    export
    recipe
    命令说明
  • references/app-preparation.md
    — 聚焦性能分析、NVTX标记、PyTorch模式
  • references/stats-reports.md
    — CUDA统计报告列及其含义
  • references/expert-systems.md
    — 专家系统规则、反模式检测
  • references/recipes-dl.md
    — 与深度学习相关的高级Recipe示例
  • references/nvtx-analysis.md
    — 带注释代码的NVTX统计报告
搜索方法:
  1. 使用
    Grep
    references/
    目录中搜索关键词
  2. 仅阅读Grep指向的文件

Tier 3: Official Documentation

Tier 3:官方文档

If Tiers 1-2 don't answer:
WebFetch or WebSearch these URLs for the latest content. Consider distilling new findings back into
references/
.
若Tier 1-2无法解答问题:
  • 用户指南 — 完整命令行参考、所有追踪选项
  • 分析指南 — Stats报告、专家系统、Recipe说明
可通过WebFetch或WebSearch获取这些URL的最新内容。考虑将新发现提炼后加入
references/
目录。