perf-nsight-systems
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNsight Systems Profiling
Nsight Systems 性能分析
NVIDIA Nsight Systems () is a system-level performance analysis tool
that captures CPU/GPU activity timelines, API traces, and OS-level events.
Unlike Nsight Compute (kernel-level), shows the big picture — how
kernels, memory transfers, communication, and CPU work overlap in time.
nsysnsysNVIDIA Nsight Systems () 是一款系统级性能分析工具,可捕获CPU/GPU活动时间线、API追踪以及操作系统级事件。与Nsight Compute(内核级分析工具)不同, 呈现全局视图——展示内核、内存传输、通信和CPU工作在时间上的重叠情况。
nsysnsysWhen to Use
使用场景
Reach for this skill when you encounter:
- Triggers: User wants to profile a training script end-to-end, analyze
GPU utilization, find pipeline bottlenecks, check communication/compute
overlap, or interpret reports
.nsys-rep - Symptoms: Training slower than expected, GPU idle between iterations, need to understand where time is spent across CPU and GPU, poor scaling in distributed training
- Keywords: "nsys", "nsight systems", "GPU timeline", "GPU utilization", "kernel launch overhead", "training profiling", "NCCL overlap", "nsys-rep", "cuda trace", "GPU idle", "pipeline stall", "data loading bottleneck"
Do NOT use this skill for:
- Kernel-level optimization (use Nsight Compute / instead)
ncu - GPU hardware metrics like SM throughput, cache hit rates (use )
ncu - GPU monitoring without profiling (use )
nvidia-smi - Non-CLI usage (GUI workflows, IDE integration) — consult official docs
当遇到以下情况时,可使用本技能:
- 触发场景:用户需要对训练脚本进行端到端性能分析、分析GPU利用率、定位流水线瓶颈、检查通信/计算重叠情况,或解读报告
.nsys-rep - 症状表现:训练速度低于预期、迭代间GPU空闲、需要了解CPU和GPU的时间消耗分布、分布式训练扩展性差
- 关键词:"nsys", "nsight systems", "GPU timeline", "GPU利用率", "内核启动开销", "训练性能分析", "NCCL overlap", "nsys-rep", "cuda trace", "GPU空闲", "流水线停滞", "数据加载瓶颈"
请勿将本技能用于:
- 内核级优化(请改用Nsight Compute / )
ncu - GPU硬件指标(如SM吞吐量、缓存命中率,使用)
ncu - 无性能分析的GPU监控(使用)
nvidia-smi - 非命令行使用场景(GUI工作流、IDE集成)——请参考官方文档
Requirements
依赖要求
| Dependency | Version | Notes |
|---|---|---|
| CUDA Toolkit | >=11.0 | Includes |
| Match CUDA version | Verify with |
| NVIDIA GPU | Any supported |
Permissions: may require or for system-wide
tracing and GPU metrics. In containers, use or
.
nsyssudoCAP_SYS_ADMIN--privileged--cap-add=SYS_ADMIN| 依赖项 | 版本 | 说明 |
|---|---|---|
| CUDA Toolkit | >=11.0 | 包含 |
| 匹配CUDA版本 | 使用 |
| NVIDIA GPU | 任意支持型号 |
权限: 可能需要或权限以进行系统级追踪和GPU指标采集。在容器环境中,需使用或参数。
nsyssudoCAP_SYS_ADMIN--privileged--cap-add=SYS_ADMINReporting Principles
报告原则
Every number must have an authoritative source. When presenting timing data,
kernel counts, API call durations, or any quantitative metric, always cite the
source: report output, rule output, exported SQLite
query result, recipe CSV, or raw command output. Show the actual command and its
output before interpreting. Never synthesize, estimate, or extrapolate numbers
that did not come from a tool output.
nsys statsnsys analyzeUse for structured analysis, not raw trace data. Always extract
metrics via targeted commands rather than trying to
read or interpret files directly. Stats reports produce compact,
tabular summaries; raw trace data can be enormous (especially with backtraces
or verbose API tracing). Run the smallest set of reports needed for the task,
then request additional reports only if the initial results raise questions.
nsys statsnsys stats -r <report>.nsys-rep所有数据必须有权威来源。当展示计时数据、内核数量、API调用时长或任何量化指标时,必须注明来源:报告输出、规则输出、导出的SQLite查询结果、recipe CSV文件或原始命令输出。在解读前,请先展示实际命令及其输出内容。切勿合成、估算或推断未来自工具输出的数据。
nsys statsnsys analyze使用进行结构化分析,而非原始追踪数据。始终通过定向的命令提取指标,而非直接读取或解读文件。Stats报告生成紧凑的表格摘要;原始追踪数据可能非常庞大(尤其是包含回溯或详细API追踪时)。仅运行任务所需的最小报告集合,仅当初始结果产生疑问时,再请求额外报告。
nsys statsnsys stats -r <report>.nsys-repWorkflows
工作流
Workflow 1: Profile a DL Training Script
工作流1:深度学习训练脚本性能分析
Goal: Capture a clean, focused profile of steady-state training iterations.
Step 1 — Add profiler markers to your training script to skip warmup:
python
undefined目标:捕获干净、聚焦的稳态训练迭代性能数据。
步骤1 — 在训练脚本中添加性能分析标记,跳过预热阶段:
python
undefinedIn training script
训练脚本中
for i, batch in enumerate(train_loader):
if i == warmup_iters:
torch.cuda.cudart().cudaProfilerStart()
train_step(model, batch)
if i == warmup_iters + profile_iters:
torch.cuda.cudart().cudaProfilerStop()
break
**Step 2 — Profile with `cudaProfilerApi` capture range:**
```bash
nsys profile -c cudaProfilerApi \
-t cuda,nvtx,cudnn,cublas \
--pytorch=autograd-nvtx \
-o train_profile -- python train.pyThis captures only steady-state iterations — no warmup, no initialization noise.
Note: enables API-specific tracing. By default,
only traces the CUDA runtime/driver layer — you see kernel names and
launch times but cannot attribute them to higher-level libraries. Adding
and traces the library-level API calls, letting you distinguish
convolution time (cuDNN) from GEMM time (cuBLAS) and measure library overhead
separately from raw kernel execution.
-t cuda,nvtx,cudnn,cublas-t cudacudnncublasStep 3 — Quick summary:
bash
nsys stats -r cuda_gpu_kern_sum,cuda_api_sum,cuda_gpu_mem_time_sum \
train_profile.nsys-repWhen you traced library APIs (, in ), also run the
library-specific reports to see API-level overhead (workspace allocation,
algorithm selection) separately from raw kernel execution:
cudnncublas-tbash
nsys stats -r cudnn_api_sum,cublas_api_sum train_profile.nsys-repStep 4 — Detect anti-patterns:
bash
nsys analyze -r all train_profile.nsys-repStep 5 — Dig deeper based on findings. See Tier 2 references.
for i, batch in enumerate(train_loader):
if i == warmup_iters:
torch.cuda.cudart().cudaProfilerStart()
train_step(model, batch)
if i == warmup_iters + profile_iters:
torch.cuda.cudart().cudaProfilerStop()
break
**步骤2 — 使用`cudaProfilerApi`捕获范围进行性能分析:**
```bash
nsys profile -c cudaProfilerApi \
-t cuda,nvtx,cudnn,cublas \
--pytorch=autograd-nvtx \
-o train_profile -- python train.py此命令仅捕获稳态迭代数据——不包含预热和初始化阶段的干扰数据。
注意:启用API定向追踪。默认情况下,仅追踪CUDA运行时/驱动层——你可以看到内核名称和启动时间,但无法将其归因于更高层级的库。添加和参数后,会追踪库级API调用,让你区分卷积时间(cuDNN)与GEMM时间(cuBLAS),并单独测量库开销与原始内核执行时间。
-t cuda,nvtx,cudnn,cublas-t cudacudnncublas步骤3 — 快速生成摘要:
bash
nsys stats -r cuda_gpu_kern_sum,cuda_api_sum,cuda_gpu_mem_time_sum \
train_profile.nsys-rep当你追踪了库API(参数中包含、),还需运行库专属报告,以查看API级开销(工作区分配、算法选择)与原始内核执行时间的区别:
-tcudnncublasbash
nsys stats -r cudnn_api_sum,cublas_api_sum train_profile.nsys-rep步骤4 — 检测反模式:
bash
nsys analyze -r all train_profile.nsys-rep步骤5 — 根据发现深入分析,参考Tier 2参考资料。
Workflow 2: Diagnose GPU Idle Time
工作流2:诊断GPU空闲时间
Goal: Find why the GPU is idle between training iterations.
Step 1 — Profile with OS runtime tracing:
bash
nsys profile -t cuda,nvtx,osrt \
--pytorch=autograd-nvtx \
-o idle_debug -- python train.pyStep 2 — Check GPU gaps and utilization:
bash
nsys analyze -r gpu_gaps,gpu_time_util idle_debug.nsys-repStep 3 — Check kernel launch phases:
bash
nsys stats -r cuda_kern_exec_sum idle_debug.nsys-repHigh queue time = GPU was busy (not the issue). Near-zero queue time for
all kernels = GPU was starved (host not submitting work fast enough).
Step 4 — Common causes and fixes:
| GPU idle cause | Evidence | Fix |
|---|---|---|
| Slow data loading | CPU busy in DataLoader during gaps | Increase |
| Synchronous memcpy | | Use |
| Over-synchronization | Frequent | Remove unnecessary sync calls |
| Host-side computation | CPU sampling shows compute during gaps | Move to GPU or overlap with async ops |
| Python GIL contention | GIL trace shows contention | Use multiprocessing, reduce Python overhead |
目标:找出训练迭代间GPU空闲的原因。
步骤1 — 启用操作系统运行时追踪进行性能分析:
bash
nsys profile -t cuda,nvtx,osrt \
--pytorch=autograd-nvtx \
-o idle_debug -- python train.py步骤2 — 检查GPU间隙与利用率:
bash
nsys analyze -r gpu_gaps,gpu_time_util idle_debug.nsys-rep步骤3 — 检查内核启动阶段:
bash
nsys stats -r cuda_kern_exec_sum idle_debug.nsys-rep高队列时间 = GPU处于忙碌状态(非问题所在)。所有内核队列时间接近零 = GPU处于饥饿状态(主机提交工作速度不足)。
步骤4 — 常见原因与修复方案:
| GPU空闲原因 | 证据 | 修复方案 |
|---|---|---|
| 数据加载缓慢 | 间隙期间CPU在DataLoader中忙碌 | 增加 |
| 同步内存拷贝 | | 使用 |
| 过度同步 | 追踪中频繁出现 | 移除不必要的同步调用 |
| 主机端计算 | CPU采样显示间隙期间存在计算操作 | 将计算迁移至GPU,或与异步操作重叠执行 |
| Python GIL竞争 | GIL追踪显示竞争情况 | 使用多进程,减少Python开销 |
Workflow 3: Profile Distributed Training
工作流3:分布式训练性能分析
Goal: Profile multi-GPU/multi-node training with communication analysis.
Step 1 — Collect per-rank profiles:
bash
nsys profile -t cuda,nvtx,mpi,ucx \
--pytorch=autograd-nvtx \
-o profile_%q{RANK} \
-- torchrun --nproc_per_node=8 train.pyStep 2 — Analyze NCCL communication/compute overlap:
bash
nsys recipe nccl_gpu_overlap_trace -- profile_*.nsys-rep
nsys recipe nccl_gpu_time_util_map -- profile_*.nsys-repStep 3 — Check per-rank utilization:
bash
nsys recipe cuda_gpu_time_util_map -- profile_*.nsys-repStep 4 — Check for stragglers:
Compare across ranks. If one rank is slower, check
its network and data loading patterns.
cuda_gpu_kern_sum目标:对多GPU/多节点训练进行通信分析。
步骤1 — 收集各rank的性能分析数据:
bash
nsys profile -t cuda,nvtx,mpi,ucx \
--pytorch=autograd-nvtx \
-o profile_%q{RANK} \
-- torchrun --nproc_per_node=8 train.py步骤2 — 分析NCCL通信/计算重叠情况:
bash
nsys recipe nccl_gpu_overlap_trace -- profile_*.nsys-rep
nsys recipe nccl_gpu_time_util_map -- profile_*.nsys-rep步骤3 — 检查各rank的利用率:
bash
nsys recipe cuda_gpu_time_util_map -- profile_*.nsys-rep步骤4 — 检查掉队节点:
对比各rank的结果。若某一rank速度较慢,检查其网络和数据加载模式。
cuda_gpu_kern_sumWorkflow 4: Analyze Iteration Time Consistency
工作流4:分析迭代时间一致性
Goal: Check whether training iterations are stable or have outliers.
bash
undefined目标:检查训练迭代是否稳定,是否存在异常值。
bash
undefinedProfile with NVTX iteration markers
使用NVTX迭代标记进行性能分析
nsys profile --pytorch=autograd-nvtx -t cuda,nvtx
-o iter_check -- python train.py
-o iter_check -- python train.py
nsys profile --pytorch=autograd-nvtx -t cuda,nvtx
-o iter_check -- python train.py
-o iter_check -- python train.py
Check iteration timing distribution
检查迭代时间分布
nsys stats -r nvtx_pushpop_sum iter_check.nsys-rep
nsys stats -r nvtx_pushpop_sum iter_check.nsys-rep
Check GPU projection per NVTX range
检查每个NVTX范围的GPU投影
nsys stats -r nvtx_gpu_proj_sum iter_check.nsys-rep
nsys stats -r nvtx_gpu_proj_sum iter_check.nsys-rep
Visual pace analysis
可视化节奏分析
nsys recipe nvtx_pace -- iter_check.nsys-rep
High StdDev in iteration duration indicates inconsistency — investigate
outlier iterations on the timeline.nsys recipe nvtx_pace -- iter_check.nsys-rep
迭代时长的高标准差表示一致性差——需在时间线上调查异常迭代。Workflow 5: Attribute Kernels to Source Code via Stack Traces
工作流5:通过堆栈追踪将内核归因于源代码
Goal: Identify which Python function or code path triggers expensive GPU kernels.
Step 1 — Profile with backtrace collection:
bash
nsys profile -t cuda,nvtx \
--backtrace=cuda \
--python-backtrace=lbr \
--pytorch=autograd-nvtx \
-o stacktrace_profile -- python train.py- : Captures CUDA API call stacks (C/C++ frames) so each
--backtrace=cudashows the host-side call chain that triggered it.cudaLaunchKernel - : Captures Python-level call stacks, correlating GPU work back to specific Python functions (e.g.,
--python-backtrace=lbrvscompute_attention).compute_ffn
Step 2 — Get kernel summary and NVTX attribution:
Use targeted stats reports to identify top kernels and their NVTX context:
bash
undefined目标:识别触发高开销GPU内核的Python函数或代码路径。
步骤1 — 开启回溯收集进行性能分析:
bash
nsys profile -t cuda,nvtx \
--backtrace=cuda \
--python-backtrace=lbr \
--pytorch=autograd-nvtx \
-o stacktrace_profile -- python train.py- :捕获CUDA API调用堆栈(C/C++帧),使每个
--backtrace=cuda都显示触发它的主机端调用链。cudaLaunchKernel - :捕获Python级调用堆栈,将GPU工作与特定Python函数关联(例如
--python-backtrace=lbrvscompute_attention)。compute_ffn
步骤2 — 获取内核摘要与NVTX归因:
使用定向stats报告识别顶级内核及其NVTX上下文:
bash
undefinedTop kernels by total GPU time
按总GPU时间排序的顶级内核
nsys stats -r cuda_gpu_kern_sum stacktrace_profile.nsys-rep
nsys stats -r cuda_gpu_kern_sum stacktrace_profile.nsys-rep
Kernels attributed to NVTX ranges (maps kernels to annotated code regions)
归因于NVTX范围的内核(将内核映射到带注释的代码区域)
nsys stats -r nvtx_kern_sum stacktrace_profile.nsys-rep
The `nvtx_kern_sum` report (requires `--pytorch=autograd-nvtx` or manual NVTX
annotations) maps each kernel to its enclosing NVTX range, directly showing
which Python function or autograd op launched it. This is more efficient than
manually cross-referencing raw backtrace data.
**Step 3 — For PyTorch models**, `--pytorch=autograd-nvtx` automatically wraps
each autograd op in an NVTX range. Combined with backtrace, this maps:
GPU kernel → CUDA API call → Python function → PyTorch autograd op.
**When to use**: Workloads with multiple code paths launching similar kernels
(e.g., attention vs FFN both calling GEMM). Stack traces disambiguate which
caller is responsible for the dominant kernel time.nsys stats -r nvtx_kern_sum stacktrace_profile.nsys-rep
`nvtx_kern_sum`报告(需要`--pytorch=autograd-nvtx`或手动NVTX注释)将每个内核映射到其所属的NVTX范围,直接显示哪个Python函数或autograd操作启动了该内核。这比手动交叉引用原始回溯数据更高效。
**步骤3 — 对于PyTorch模型**,`--pytorch=autograd-nvtx`会自动将每个autograd操作包装在NVTX范围内。结合回溯功能,可实现映射:GPU内核 → CUDA API调用 → Python函数 → PyTorch autograd操作。
**适用场景**:存在多个代码路径启动相似内核的工作负载(例如注意力和FFN都调用GEMM)。堆栈追踪可明确哪个调用方是内核时间的主要贡献者。Output Formats
输出格式
Report files (): Binary format, viewable in GUI or processed
with , , , .
.nsys-repnsys statsnsys analyzensys exportnsys recipeStats output formats: (terminal), , , , ,
, .
columncsvjsontabletsvhdochtableExport formats: (SQL queries), / (Pandas/Dask),
, , .
sqlitearrowparquetdirhdfjsonlinestextRecipe output: Directory with CSV/Parquet data + Plotly HTML visualizations
- (Jupyter notebook).
.nsys-analysis
Key stats report columns:
| Report | Key columns |
|---|---|
| Time%, Total Time, Instances, Kernel Name |
| Time%, Total Time, Num Calls, API Name |
| API Time, Queue Time, Kernel Time |
| Time%, Total Time, Operations, Direction |
| Projected Duration, Original Duration, GPU Op Count |
报告文件():二进制格式,可在GUI中查看,或使用、、、处理。
.nsys-repnsys statsnsys analyzensys exportnsys recipeStats输出格式:(终端)、、、、、、。
columncsvjsontabletsvhdochtable导出格式:(SQL查询)、/(Pandas/Dask)、、、。
sqlitearrowparquetdirhdfjsonlinestextRecipe输出:包含CSV/Parquet数据 + Plotly HTML可视化 + (Jupyter notebook)的目录。
.nsys-analysis关键Stats报告列:
| 报告 | 关键列 |
|---|---|
| Time%, Total Time, Instances, Kernel Name |
| Time%, Total Time, Num Calls, API Name |
| API Time, Queue Time, Kernel Time |
| Time%, Total Time, Operations, Direction |
| Projected Duration, Original Duration, GPU Op Count |
Examples
示例
Example 1: Quick DL Profile and Summary
示例1:快速深度学习性能分析与摘要
bash
undefinedbash
undefinedProfile
性能分析
nsys profile -t cuda,nvtx,cudnn,cublas
--pytorch=autograd-nvtx --stats=true
-o quick_profile -- python train.py
--pytorch=autograd-nvtx --stats=true
-o quick_profile -- python train.py
nsys profile -t cuda,nvtx,cudnn,cublas
--pytorch=autograd-nvtx --stats=true
-o quick_profile -- python train.py
--pytorch=autograd-nvtx --stats=true
-o quick_profile -- python train.py
Auto-generates stats at the end of profiling
在性能分析结束时自动生成stats报告
undefinedundefinedExample 2: Detect Sync Memcpy in DataLoader
示例2:检测DataLoader中的同步内存拷贝
bash
nsys profile -t cuda,nvtx -o dataloader_check -- python train.py
nsys analyze -r cuda_memcpy_sync,cuda_memcpy_async dataloader_check.nsys-repIf flagged, fix with:
python
loader = DataLoader(dataset, pin_memory=True, num_workers=4)
tensor_gpu = tensor_cpu.to(device, non_blocking=True)bash
nsys profile -t cuda,nvtx -o dataloader_check -- python train.py
nsys analyze -r cuda_memcpy_sync,cuda_memcpy_async dataloader_check.nsys-rep若被标记,修复方案:
python
loader = DataLoader(dataset, pin_memory=True, num_workers=4)
tensor_gpu = tensor_cpu.to(device, non_blocking=True)Example 3: Multi-Node NCCL Analysis
示例3:多节点NCCL分析
bash
undefinedbash
undefinedCollect
收集数据
nsys profile -t cuda,nvtx,mpi -o rank_%q{RANK}
-- torchrun --nproc_per_node=8 train.py
-- torchrun --nproc_per_node=8 train.py
nsys profile -t cuda,nvtx,mpi -o rank_%q{RANK}
-- torchrun --nproc_per_node=8 train.py
-- torchrun --nproc_per_node=8 train.py
Analyze overlap
分析重叠情况
nsys recipe nccl_gpu_overlap_trace -- rank_*.nsys-rep
nsys recipe nccl_gpu_overlap_trace -- rank_*.nsys-rep
Visualize
可视化
nsys recipe nccl_gpu_time_util_map -- rank_*.nsys-rep
undefinednsys recipe nccl_gpu_time_util_map -- rank_*.nsys-rep
undefinedExample 4: API-Level Breakdown (cuDNN vs cuBLAS)
示例4:API级分解(cuDNN vs cuBLAS)
bash
undefinedbash
undefinedProfile with library-level tracing
启用库级追踪进行性能分析
nsys profile -t cuda,nvtx,cudnn,cublas
-o api_breakdown -- python model.py
-o api_breakdown -- python model.py
nsys profile -t cuda,nvtx,cudnn,cublas
-o api_breakdown -- python model.py
-o api_breakdown -- python model.py
cuDNN API summary (convolution calls)
cuDNN API摘要(卷积调用)
nsys stats -r cudnn_api_sum api_breakdown.nsys-rep
nsys stats -r cudnn_api_sum api_breakdown.nsys-rep
cuBLAS API summary (GEMM calls)
cuBLAS API摘要(GEMM调用)
nsys stats -r cublas_api_sum api_breakdown.nsys-rep
nsys stats -r cublas_api_sum api_breakdown.nsys-rep
Compare with kernel-level view
与内核级视图对比
nsys stats -r cuda_gpu_kern_sum api_breakdown.nsys-rep
The API-level reports (`cudnn_api_sum`, `cublas_api_sum`) show time spent in
library calls including overhead (workspace allocation, algorithm selection),
while `cuda_gpu_kern_sum` shows only raw GPU kernel execution. The difference
reveals library-side overhead.nsys stats -r cuda_gpu_kern_sum api_breakdown.nsys-rep
API级报告(`cudnn_api_sum`、`cublas_api_sum`)显示库调用的耗时(包括开销:工作区分配、算法选择),而`cuda_gpu_kern_sum`仅显示原始GPU内核执行时间。两者的差值即为库端开销。Error Handling
错误处理
| Error | Cause | Fix |
|---|---|---|
| Not in PATH | |
| Needs elevated privileges | |
| No CUDA activity captured | App didn't use GPU during collection window | Adjust |
| Report file very large | Long profile with many APIs traced | Use focused capture ( |
| Wrong nsys version or Python env | Verify nsys version supports |
| No matching activity in report | Check |
| MPI rank profiles out of sync | Clock skew between nodes | Use NTP sync; analyze per-rank independently |
| Missing | Add |
| Recipe fails with import error | Missing Python dependencies | Install recipe dependencies: |
| 错误 | 原因 | 修复方案 |
|---|---|---|
| 未在PATH中 | |
| 需要提升权限 | 使用 |
| 未捕获到CUDA活动 | 采集窗口内应用未使用GPU | 调整 |
| 报告文件过大 | 长时间性能分析且追踪了大量API | 使用聚焦捕获( |
| nsys版本不兼容或Python环境问题 | 验证nsys版本支持 |
| 报告中无匹配活动 | 检查 |
| MPI rank报告不同步 | 节点间时钟偏差 | 使用NTP同步;独立分析每个rank的报告 |
| 缺少 | 添加 |
| Recipe执行失败并提示导入错误 | 缺少Python依赖 | 安装Recipe依赖: |
Finding More Information
获取更多信息
Tier 1: This File (SKILL.md)
Tier 1:本文档(SKILL.md)
You are reading it now. The workflows and error table above cover the most
common DL profiling tasks. Search this file first.
你正在阅读本文档。上述工作流和错误表涵盖了最常见的深度学习性能分析任务。请先在本文档中搜索。
Tier 2: references/ Directory
Tier 2:references/ 目录
Grep for keywords across — headers are grep-friendly:
references/- — Complete
references/cli-profiling.mdflags for DLnsys profile - —
references/cli-post-collection.md,nsys stats,analyze,exportcommandsrecipe - — Focused profiling, NVTX markers, PyTorch patterns
references/app-preparation.md - — CUDA statistical report columns and meanings
references/stats-reports.md - — Expert system rules, anti-pattern detection
references/expert-systems.md - — DL-relevant advanced recipes with examples
references/recipes-dl.md - — NVTX statistical reports for annotated code
references/nvtx-analysis.md
How to search:
- for your keyword across
Grepreferences/ - only the file that Grep points to
Read
在目录中搜索关键词——标题支持grep检索:
references/- — 适用于深度学习的完整
references/cli-profiling.md参数说明nsys profile - —
references/cli-post-collection.md、nsys stats、analyze、export命令说明recipe - — 聚焦性能分析、NVTX标记、PyTorch模式
references/app-preparation.md - — CUDA统计报告列及其含义
references/stats-reports.md - — 专家系统规则、反模式检测
references/expert-systems.md - — 与深度学习相关的高级Recipe示例
references/recipes-dl.md - — 带注释代码的NVTX统计报告
references/nvtx-analysis.md
搜索方法:
- 使用在
Grep目录中搜索关键词references/ - 仅阅读Grep指向的文件
Tier 3: Official Documentation
Tier 3:官方文档
If Tiers 1-2 don't answer:
- User Guide — Full CLI reference, all tracing options
- Analysis Guide — Stats reports, expert systems, recipes
WebFetch or WebSearch these URLs for the latest content. Consider distilling
new findings back into .
references/