perf-nsight-systems

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Nsight Systems Profiling

Nsight Systems 性能分析

NVIDIA Nsight Systems (

nsys

) is a system-level performance analysis tool that captures CPU/GPU activity timelines, API traces, and OS-level events. Unlike Nsight Compute (kernel-level),

nsys

shows the big picture — how kernels, memory transfers, communication, and CPU work overlap in time.

NVIDIA Nsight Systems (

nsys

) 是一款系统级性能分析工具，可捕获CPU/GPU活动时间线、API追踪以及操作系统级事件。与Nsight Compute（内核级分析工具）不同，

nsys

呈现全局视图——展示内核、内存传输、通信和CPU工作在时间上的重叠情况。

When to Use

使用场景

Reach for this skill when you encounter:

Triggers: User wants to profile a training script end-to-end, analyze GPU utilization, find pipeline bottlenecks, check communication/compute overlap, or interpret
```
.nsys-rep
```
reports
Symptoms: Training slower than expected, GPU idle between iterations, need to understand where time is spent across CPU and GPU, poor scaling in distributed training
Keywords: "nsys", "nsight systems", "GPU timeline", "GPU utilization", "kernel launch overhead", "training profiling", "NCCL overlap", "nsys-rep", "cuda trace", "GPU idle", "pipeline stall", "data loading bottleneck"

Do NOT use this skill for:

Kernel-level optimization (use Nsight Compute /
```
ncu
```
instead)
GPU hardware metrics like SM throughput, cache hit rates (use
```
ncu
```
)
GPU monitoring without profiling (use
```
nvidia-smi
```
)
Non-CLI usage (GUI workflows, IDE integration) — consult official docs

当遇到以下情况时，可使用本技能：

触发场景：用户需要对训练脚本进行端到端性能分析、分析GPU利用率、定位流水线瓶颈、检查通信/计算重叠情况，或解读
```
.nsys-rep
```
报告
症状表现：训练速度低于预期、迭代间GPU空闲、需要了解CPU和GPU的时间消耗分布、分布式训练扩展性差
关键词："nsys", "nsight systems", "GPU timeline", "GPU利用率", "内核启动开销", "训练性能分析", "NCCL overlap", "nsys-rep", "cuda trace", "GPU空闲", "流水线停滞", "数据加载瓶颈"

请勿将本技能用于：

内核级优化（请改用Nsight Compute /
```
ncu
```
）
GPU硬件指标（如SM吞吐量、缓存命中率，使用
```
ncu
```
）
无性能分析的GPU监控（使用
```
nvidia-smi
```
）
非命令行使用场景（GUI工作流、IDE集成）——请参考官方文档

Requirements

依赖要求

Dependency	Version	Notes
CUDA Toolkit	>=11.0	Includes `nsys`
`nsys` binary	Match CUDA version	Verify with `nsys -v`
NVIDIA GPU	Any supported

Permissions:

nsys

may require

sudo

CAP_SYS_ADMIN

for system-wide tracing and GPU metrics. In containers, use

--privileged

--cap-add=SYS_ADMIN

依赖项	版本	说明
CUDA Toolkit	>=11.0	包含 `nsys` 工具
`nsys` 二进制文件	匹配CUDA版本	使用 `nsys -v` 验证
NVIDIA GPU	任意支持型号

权限：

nsys

可能需要

sudo

或

CAP_SYS_ADMIN

权限以进行系统级追踪和GPU指标采集。在容器环境中，需使用

--privileged

或

--cap-add=SYS_ADMIN

参数。

Reporting Principles

报告原则

Every number must have an authoritative source. When presenting timing data, kernel counts, API call durations, or any quantitative metric, always cite the source:

nsys stats

report output,

nsys analyze

rule output, exported SQLite query result, recipe CSV, or raw command output. Show the actual command and its output before interpreting. Never synthesize, estimate, or extrapolate numbers that did not come from a tool output.

Use
nsys stats
for structured analysis, not raw trace data. Always extract metrics via targeted

nsys stats -r <report>

commands rather than trying to read or interpret

.nsys-rep

files directly. Stats reports produce compact, tabular summaries; raw trace data can be enormous (especially with backtraces or verbose API tracing). Run the smallest set of reports needed for the task, then request additional reports only if the initial results raise questions.

所有数据必须有权威来源。当展示计时数据、内核数量、API调用时长或任何量化指标时，必须注明来源：

nsys stats

报告输出、

nsys analyze

规则输出、导出的SQLite查询结果、recipe CSV文件或原始命令输出。在解读前，请先展示实际命令及其输出内容。切勿合成、估算或推断未来自工具输出的数据。

使用
nsys stats
进行结构化分析，而非原始追踪数据。始终通过定向的

nsys stats -r <report>

命令提取指标，而非直接读取或解读

.nsys-rep

文件。Stats报告生成紧凑的表格摘要；原始追踪数据可能非常庞大（尤其是包含回溯或详细API追踪时）。仅运行任务所需的最小报告集合，仅当初始结果产生疑问时，再请求额外报告。

Workflows

工作流

Workflow 1: Profile a DL Training Script

工作流1：深度学习训练脚本性能分析

Goal: Capture a clean, focused profile of steady-state training iterations.

Step 1 — Add profiler markers to your training script to skip warmup:

python

undefined

目标：捕获干净、聚焦的稳态训练迭代性能数据。

步骤1 — 在训练脚本中添加性能分析标记，跳过预热阶段：

python

undefined

In training script

训练脚本中

for i, batch in enumerate(train_loader): if i == warmup_iters: torch.cuda.cudart().cudaProfilerStart() train_step(model, batch) if i == warmup_iters + profile_iters: torch.cuda.cudart().cudaProfilerStop() break


**Step 2 — Profile with `cudaProfilerApi` capture range:**

```bash
nsys profile -c cudaProfilerApi \
    -t cuda,nvtx,cudnn,cublas \
    --pytorch=autograd-nvtx \
    -o train_profile -- python train.py

This captures only steady-state iterations — no warmup, no initialization noise.

Note:

-t cuda,nvtx,cudnn,cublas

enables API-specific tracing. By default,

-t cuda

only traces the CUDA runtime/driver layer — you see kernel names and launch times but cannot attribute them to higher-level libraries. Adding

cudnn

and

cublas

traces the library-level API calls, letting you distinguish convolution time (cuDNN) from GEMM time (cuBLAS) and measure library overhead separately from raw kernel execution.

Step 3 — Quick summary:

bash

nsys stats -r cuda_gpu_kern_sum,cuda_api_sum,cuda_gpu_mem_time_sum \
    train_profile.nsys-rep

When you traced library APIs (

cudnn

cublas

-t

), also run the library-specific reports to see API-level overhead (workspace allocation, algorithm selection) separately from raw kernel execution:

bash

nsys stats -r cudnn_api_sum,cublas_api_sum train_profile.nsys-rep

Step 4 — Detect anti-patterns:

bash

nsys analyze -r all train_profile.nsys-rep

Step 5 — Dig deeper based on findings. See Tier 2 references.


**步骤2 — 使用`cudaProfilerApi`捕获范围进行性能分析：**

```bash
nsys profile -c cudaProfilerApi \
    -t cuda,nvtx,cudnn,cublas \
    --pytorch=autograd-nvtx \
    -o train_profile -- python train.py

此命令仅捕获稳态迭代数据——不包含预热和初始化阶段的干扰数据。

注意：

-t cuda,nvtx,cudnn,cublas

启用API定向追踪。默认情况下，

-t cuda

仅追踪CUDA运行时/驱动层——你可以看到内核名称和启动时间，但无法将其归因于更高层级的库。添加

cudnn

和

cublas

参数后，会追踪库级API调用，让你区分卷积时间（cuDNN）与GEMM时间（cuBLAS），并单独测量库开销与原始内核执行时间。

步骤3 — 快速生成摘要：

bash

nsys stats -r cuda_gpu_kern_sum,cuda_api_sum,cuda_gpu_mem_time_sum \
    train_profile.nsys-rep

当你追踪了库API（

-t

参数中包含

cudnn

、

cublas

），还需运行库专属报告，以查看API级开销（工作区分配、算法选择）与原始内核执行时间的区别：

bash

nsys stats -r cudnn_api_sum,cublas_api_sum train_profile.nsys-rep

步骤4 — 检测反模式：

bash

nsys analyze -r all train_profile.nsys-rep

步骤5 — 根据发现深入分析，参考Tier 2参考资料。

Workflow 2: Diagnose GPU Idle Time

工作流2：诊断GPU空闲时间

Goal: Find why the GPU is idle between training iterations.

Step 1 — Profile with OS runtime tracing:

bash

nsys profile -t cuda,nvtx,osrt \
    --pytorch=autograd-nvtx \
    -o idle_debug -- python train.py

Step 2 — Check GPU gaps and utilization:

bash

nsys analyze -r gpu_gaps,gpu_time_util idle_debug.nsys-rep

Step 3 — Check kernel launch phases:

bash

nsys stats -r cuda_kern_exec_sum idle_debug.nsys-rep

High queue time = GPU was busy (not the issue). Near-zero queue time for all kernels = GPU was starved (host not submitting work fast enough).

Step 4 — Common causes and fixes:

GPU idle cause	Evidence	Fix
Slow data loading	CPU busy in DataLoader during gaps	Increase `num_workers` , use `pin_memory=True`
Synchronous memcpy	`cuda_memcpy_sync` rule fires	Use `non_blocking=True` transfers
Over-synchronization	Frequent `cudaDeviceSynchronize` in trace	Remove unnecessary sync calls
Host-side computation	CPU sampling shows compute during gaps	Move to GPU or overlap with async ops
Python GIL contention	GIL trace shows contention	Use multiprocessing, reduce Python overhead

目标：找出训练迭代间GPU空闲的原因。

步骤1 — 启用操作系统运行时追踪进行性能分析：

bash

nsys profile -t cuda,nvtx,osrt \
    --pytorch=autograd-nvtx \
    -o idle_debug -- python train.py

步骤2 — 检查GPU间隙与利用率：

bash

nsys analyze -r gpu_gaps,gpu_time_util idle_debug.nsys-rep

步骤3 — 检查内核启动阶段：

bash

nsys stats -r cuda_kern_exec_sum idle_debug.nsys-rep

高队列时间 = GPU处于忙碌状态（非问题所在）。所有内核队列时间接近零 = GPU处于饥饿状态（主机提交工作速度不足）。

步骤4 — 常见原因与修复方案：

GPU空闲原因	证据	修复方案
数据加载缓慢	间隙期间CPU在DataLoader中忙碌	增加 `num_workers` ，启用 `pin_memory=True`
同步内存拷贝	`cuda_memcpy_sync` 规则触发	使用 `non_blocking=True` 进行内存传输
过度同步	追踪中频繁出现 `cudaDeviceSynchronize`	移除不必要的同步调用
主机端计算	CPU采样显示间隙期间存在计算操作	将计算迁移至GPU，或与异步操作重叠执行
Python GIL竞争	GIL追踪显示竞争情况	使用多进程，减少Python开销

Workflow 3: Profile Distributed Training

工作流3：分布式训练性能分析

Goal: Profile multi-GPU/multi-node training with communication analysis.

Step 1 — Collect per-rank profiles:

bash

nsys profile -t cuda,nvtx,mpi,ucx \
    --pytorch=autograd-nvtx \
    -o profile_%q{RANK} \
    -- torchrun --nproc_per_node=8 train.py

Step 2 — Analyze NCCL communication/compute overlap:

bash

nsys recipe nccl_gpu_overlap_trace -- profile_*.nsys-rep
nsys recipe nccl_gpu_time_util_map -- profile_*.nsys-rep

Step 3 — Check per-rank utilization:

bash

nsys recipe cuda_gpu_time_util_map -- profile_*.nsys-rep

Step 4 — Check for stragglers:

Compare

cuda_gpu_kern_sum

across ranks. If one rank is slower, check its network and data loading patterns.

目标：对多GPU/多节点训练进行通信分析。

步骤1 — 收集各rank的性能分析数据：

bash

nsys profile -t cuda,nvtx,mpi,ucx \
    --pytorch=autograd-nvtx \
    -o profile_%q{RANK} \
    -- torchrun --nproc_per_node=8 train.py

步骤2 — 分析NCCL通信/计算重叠情况：

bash

nsys recipe nccl_gpu_overlap_trace -- profile_*.nsys-rep
nsys recipe nccl_gpu_time_util_map -- profile_*.nsys-rep

步骤3 — 检查各rank的利用率：

bash

nsys recipe cuda_gpu_time_util_map -- profile_*.nsys-rep

步骤4 — 检查掉队节点：

对比各rank的

cuda_gpu_kern_sum

结果。若某一rank速度较慢，检查其网络和数据加载模式。

Workflow 4: Analyze Iteration Time Consistency

工作流4：分析迭代时间一致性

Goal: Check whether training iterations are stable or have outliers.

bash

undefined

目标：检查训练迭代是否稳定，是否存在异常值。

bash

undefined

Profile with NVTX iteration markers

使用NVTX迭代标记进行性能分析

nsys profile --pytorch=autograd-nvtx -t cuda,nvtx
-o iter_check -- python train.py

Check iteration timing distribution

检查迭代时间分布

nsys stats -r nvtx_pushpop_sum iter_check.nsys-rep

Check GPU projection per NVTX range

检查每个NVTX范围的GPU投影

nsys stats -r nvtx_gpu_proj_sum iter_check.nsys-rep

Visual pace analysis

可视化节奏分析

nsys recipe nvtx_pace -- iter_check.nsys-rep


High StdDev in iteration duration indicates inconsistency — investigate
outlier iterations on the timeline.

nsys recipe nvtx_pace -- iter_check.nsys-rep


迭代时长的高标准差表示一致性差——需在时间线上调查异常迭代。

Workflow 5: Attribute Kernels to Source Code via Stack Traces

工作流5：通过堆栈追踪将内核归因于源代码

Goal: Identify which Python function or code path triggers expensive GPU kernels.

Step 1 — Profile with backtrace collection:

bash

nsys profile -t cuda,nvtx \
    --backtrace=cuda \
    --python-backtrace=lbr \
    --pytorch=autograd-nvtx \
    -o stacktrace_profile -- python train.py

```
--backtrace=cuda
```
: Captures CUDA API call stacks (C/C++ frames) so each
```
cudaLaunchKernel
```
shows the host-side call chain that triggered it.
```
--python-backtrace=lbr
```
: Captures Python-level call stacks, correlating GPU work back to specific Python functions (e.g.,
```
compute_attention
```
vs
```
compute_ffn
```
).

Step 2 — Get kernel summary and NVTX attribution:

Use targeted stats reports to identify top kernels and their NVTX context:

bash

undefined

目标：识别触发高开销GPU内核的Python函数或代码路径。

步骤1 — 开启回溯收集进行性能分析：

bash

nsys profile -t cuda,nvtx \
    --backtrace=cuda \
    --python-backtrace=lbr \
    --pytorch=autograd-nvtx \
    -o stacktrace_profile -- python train.py

```
--backtrace=cuda
```
：捕获CUDA API调用堆栈（C/C++帧），使每个
```
cudaLaunchKernel
```
都显示触发它的主机端调用链。
```
--python-backtrace=lbr
```
：捕获Python级调用堆栈，将GPU工作与特定Python函数关联（例如
```
compute_attention
```
vs
```
compute_ffn
```
）。

步骤2 — 获取内核摘要与NVTX归因：

使用定向stats报告识别顶级内核及其NVTX上下文：

bash

undefined

Top kernels by total GPU time

按总GPU时间排序的顶级内核

nsys stats -r cuda_gpu_kern_sum stacktrace_profile.nsys-rep

Kernels attributed to NVTX ranges (maps kernels to annotated code regions)

归因于NVTX范围的内核（将内核映射到带注释的代码区域）

nsys stats -r nvtx_kern_sum stacktrace_profile.nsys-rep


The `nvtx_kern_sum` report (requires `--pytorch=autograd-nvtx` or manual NVTX
annotations) maps each kernel to its enclosing NVTX range, directly showing
which Python function or autograd op launched it. This is more efficient than
manually cross-referencing raw backtrace data.

**Step 3 — For PyTorch models**, `--pytorch=autograd-nvtx` automatically wraps
each autograd op in an NVTX range. Combined with backtrace, this maps:
GPU kernel → CUDA API call → Python function → PyTorch autograd op.

**When to use**: Workloads with multiple code paths launching similar kernels
(e.g., attention vs FFN both calling GEMM). Stack traces disambiguate which
caller is responsible for the dominant kernel time.

nsys stats -r nvtx_kern_sum stacktrace_profile.nsys-rep


`nvtx_kern_sum`报告（需要`--pytorch=autograd-nvtx`或手动NVTX注释）将每个内核映射到其所属的NVTX范围，直接显示哪个Python函数或autograd操作启动了该内核。这比手动交叉引用原始回溯数据更高效。

**步骤3 — 对于PyTorch模型**，`--pytorch=autograd-nvtx`会自动将每个autograd操作包装在NVTX范围内。结合回溯功能，可实现映射：GPU内核 → CUDA API调用 → Python函数 → PyTorch autograd操作。

**适用场景**：存在多个代码路径启动相似内核的工作负载（例如注意力和FFN都调用GEMM）。堆栈追踪可明确哪个调用方是内核时间的主要贡献者。

Output Formats

输出格式

Report files (

.nsys-rep

): Binary format, viewable in GUI or processed with

nsys stats

nsys analyze

nsys export

nsys recipe

Stats output formats:

column

(terminal),

csv

json

table

tsv

hdoc

htable

Export formats:

sqlite

(SQL queries),

arrow

parquetdir

(Pandas/Dask),

hdf

jsonlines

text

Recipe output: Directory with CSV/Parquet data + Plotly HTML visualizations

```
.nsys-analysis
```
(Jupyter notebook).

Key stats report columns:

Report	Key columns
`cuda_gpu_kern_sum`	Time%, Total Time, Instances, Kernel Name
`cuda_api_sum`	Time%, Total Time, Num Calls, API Name
`cuda_kern_exec_sum`	API Time, Queue Time, Kernel Time
`cuda_gpu_mem_time_sum`	Time%, Total Time, Operations, Direction
`nvtx_gpu_proj_sum`	Projected Duration, Original Duration, GPU Op Count

报告文件（

.nsys-rep

）：二进制格式，可在GUI中查看，或使用

nsys stats

、

nsys analyze

、

nsys export

、

nsys recipe

处理。

Stats输出格式：

column

（终端）、

csv

、

json

、

table

、

tsv

、

hdoc

、

htable

。

导出格式：

sqlite

（SQL查询）、

arrow

parquetdir

（Pandas/Dask）、

hdf

、

jsonlines

、

text

。

Recipe输出：包含CSV/Parquet数据 + Plotly HTML可视化 +

.nsys-analysis

（Jupyter notebook）的目录。

关键Stats报告列：

报告	关键列
`cuda_gpu_kern_sum`	Time%, Total Time, Instances, Kernel Name
`cuda_api_sum`	Time%, Total Time, Num Calls, API Name
`cuda_kern_exec_sum`	API Time, Queue Time, Kernel Time
`cuda_gpu_mem_time_sum`	Time%, Total Time, Operations, Direction
`nvtx_gpu_proj_sum`	Projected Duration, Original Duration, GPU Op Count

Examples

示例

Example 1: Quick DL Profile and Summary

示例1：快速深度学习性能分析与摘要

bash

undefined

bash

undefined

Profile

性能分析

nsys profile -t cuda,nvtx,cudnn,cublas
--pytorch=autograd-nvtx --stats=true
-o quick_profile -- python train.py

Auto-generates stats at the end of profiling

在性能分析结束时自动生成stats报告

undefined

undefined

Example 2: Detect Sync Memcpy in DataLoader

示例2：检测DataLoader中的同步内存拷贝

bash

nsys profile -t cuda,nvtx -o dataloader_check -- python train.py
nsys analyze -r cuda_memcpy_sync,cuda_memcpy_async dataloader_check.nsys-rep

If flagged, fix with:

python

loader = DataLoader(dataset, pin_memory=True, num_workers=4)
tensor_gpu = tensor_cpu.to(device, non_blocking=True)

bash

nsys profile -t cuda,nvtx -o dataloader_check -- python train.py
nsys analyze -r cuda_memcpy_sync,cuda_memcpy_async dataloader_check.nsys-rep

若被标记，修复方案：

python

loader = DataLoader(dataset, pin_memory=True, num_workers=4)
tensor_gpu = tensor_cpu.to(device, non_blocking=True)

Example 3: Multi-Node NCCL Analysis

示例3：多节点NCCL分析

bash

undefined

bash

undefined

Collect

收集数据

nsys profile -t cuda,nvtx,mpi -o rank_%q{RANK}
-- torchrun --nproc_per_node=8 train.py

Analyze overlap

分析重叠情况

nsys recipe nccl_gpu_overlap_trace -- rank_*.nsys-rep

Visualize

可视化

nsys recipe nccl_gpu_time_util_map -- rank_*.nsys-rep

undefined

nsys recipe nccl_gpu_time_util_map -- rank_*.nsys-rep

undefined

Example 4: API-Level Breakdown (cuDNN vs cuBLAS)

示例4：API级分解（cuDNN vs cuBLAS）

bash

undefined

bash

undefined

Profile with library-level tracing

启用库级追踪进行性能分析

nsys profile -t cuda,nvtx,cudnn,cublas
-o api_breakdown -- python model.py

cuDNN API summary (convolution calls)

cuDNN API摘要（卷积调用）

nsys stats -r cudnn_api_sum api_breakdown.nsys-rep

cuBLAS API summary (GEMM calls)

cuBLAS API摘要（GEMM调用）

nsys stats -r cublas_api_sum api_breakdown.nsys-rep

Compare with kernel-level view

与内核级视图对比

nsys stats -r cuda_gpu_kern_sum api_breakdown.nsys-rep


The API-level reports (`cudnn_api_sum`, `cublas_api_sum`) show time spent in
library calls including overhead (workspace allocation, algorithm selection),
while `cuda_gpu_kern_sum` shows only raw GPU kernel execution. The difference
reveals library-side overhead.

nsys stats -r cuda_gpu_kern_sum api_breakdown.nsys-rep


API级报告（`cudnn_api_sum`、`cublas_api_sum`）显示库调用的耗时（包括开销：工作区分配、算法选择），而`cuda_gpu_kern_sum`仅显示原始GPU内核执行时间。两者的差值即为库端开销。

Error Handling

错误处理

Error	Cause	Fix
`nsys: command not found`	Not in PATH	`export PATH=$PATH:/usr/local/cuda/bin`
`Permission denied` or `requires root`	Needs elevated privileges	`sudo nsys ...` or `--cap-add=SYS_ADMIN` in containers
No CUDA activity captured	App didn't use GPU during collection window	Adjust `--delay` / `--duration` , or use `cudaProfilerApi` capture range
Report file very large	Long profile with many APIs traced	Use focused capture ( `-c cudaProfilerApi` ), reduce `--duration`
`--pytorch` has no effect	Wrong nsys version or Python env	Verify nsys version supports `--pytorch` ; check Python is in PATH
`nsys stats` shows empty reports	No matching activity in report	Check `--trace` flags included the right APIs
MPI rank profiles out of sync	Clock skew between nodes	Use NTP sync; analyze per-rank independently
`cudaProfilerStart` not captured	Missing `-c cudaProfilerApi` flag	Add `--capture-range=cudaProfilerApi`
Recipe fails with import error	Missing Python dependencies	Install recipe dependencies: `pip install pandas plotly`

错误	原因	修复方案
`nsys: command not found`	未在PATH中	`export PATH=$PATH:/usr/local/cuda/bin`
`Permission denied` 或 `requires root`	需要提升权限	使用 `sudo nsys ...` ，或在容器中添加 `--cap-add=SYS_ADMIN` 参数
未捕获到CUDA活动	采集窗口内应用未使用GPU	调整 `--delay` / `--duration` 参数，或使用 `cudaProfilerApi` 捕获范围
报告文件过大	长时间性能分析且追踪了大量API	使用聚焦捕获（ `-c cudaProfilerApi` ），缩短 `--duration`
`--pytorch` 参数无效	nsys版本不兼容或Python环境问题	验证nsys版本支持 `--pytorch` ；检查Python是否在PATH中
`nsys stats` 显示空报告	报告中无匹配活动	检查 `--trace` 参数是否包含正确的API
MPI rank报告不同步	节点间时钟偏差	使用NTP同步；独立分析每个rank的报告
`cudaProfilerStart` 未被捕获	缺少 `-c cudaProfilerApi` 参数	添加 `--capture-range=cudaProfilerApi`
Recipe执行失败并提示导入错误	缺少Python依赖	安装Recipe依赖： `pip install pandas plotly`

Finding More Information

获取更多信息

Tier 1: This File (SKILL.md)

Tier 1：本文档（SKILL.md）

You are reading it now. The workflows and error table above cover the most common DL profiling tasks. Search this file first.

你正在阅读本文档。上述工作流和错误表涵盖了最常见的深度学习性能分析任务。请先在本文档中搜索。

Tier 2: references/ Directory

Tier 2：references/ 目录

Grep for keywords across

references/

— headers are grep-friendly:

```
references/cli-profiling.md
```
— Complete
```
nsys profile
```
flags for DL

references/cli-post-collection.md

—

nsys stats

analyze

export

recipe

commands

```
references/app-preparation.md
```
— Focused profiling, NVTX markers, PyTorch patterns
```
references/stats-reports.md
```
— CUDA statistical report columns and meanings
```
references/expert-systems.md
```
— Expert system rules, anti-pattern detection
```
references/recipes-dl.md
```
— DL-relevant advanced recipes with examples
```
references/nvtx-analysis.md
```
— NVTX statistical reports for annotated code

How to search:

```
Grep
```
for your keyword across
```
references/
```
```
Read
```
only the file that Grep points to

在

references/

目录中搜索关键词——标题支持grep检索：

```
references/cli-profiling.md
```
— 适用于深度学习的完整
```
nsys profile
```
参数说明

references/cli-post-collection.md

—

nsys stats

、

analyze

、

export

、

recipe

命令说明

```
references/app-preparation.md
```
— 聚焦性能分析、NVTX标记、PyTorch模式
```
references/stats-reports.md
```
— CUDA统计报告列及其含义
```
references/expert-systems.md
```
— 专家系统规则、反模式检测
```
references/recipes-dl.md
```
— 与深度学习相关的高级Recipe示例
```
references/nvtx-analysis.md
```
— 带注释代码的NVTX统计报告

搜索方法：

使用
```
Grep
```
在
```
references/
```
目录中搜索关键词
仅阅读Grep指向的文件

Tier 3: Official Documentation

Tier 3：官方文档

If Tiers 1-2 don't answer:

User Guide — Full CLI reference, all tracing options
Analysis Guide — Stats reports, expert systems, recipes

WebFetch or WebSearch these URLs for the latest content. Consider distilling new findings back into

references/

若Tier 1-2无法解答问题：

用户指南 — 完整命令行参考、所有追踪选项
分析指南 — Stats报告、专家系统、Recipe说明

可通过WebFetch或WebSearch获取这些URL的最新内容。考虑将新发现提炼后加入

references/

目录。