ascend-profiling-anomaly
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAscend Profiling Anomaly Discovery Skill
Ascend性能分析异常发现工具
Purpose
用途
Analyze Ascend NPU profiling data through three parallel pipelines:
- Structure breakdown: step → structure (layer) → block / side → op → PMU judgement — answers where the time goes.
- Anomaly discovery: step → device busy union → bubble detection → anomaly tags → soft attribution — answers what looks unnatural and where hidden issues may lurk.
- Model architecture analysis: FIA timeline → pass boundaries → layer classification → per-layer sub-structure → communication pipeline → architecture summary — answers what is this model and how does each component execute. Produces a separate Markdown report file.
The core philosophy is separation of concerns: "anomaly exists" is a hard fact derived from device intervals; "why it exists" is a soft attribution that may require additional evidence. Even under weak profiling configurations (no stacks, no shapes, sparse host events), the skill must still reliably surface device idle bubbles and risk labels.
通过三个并行流水线分析Ascend NPU性能分析数据:
- 结构拆解:step → structure(层)→ block / side → op → PMU判定 — 回答时间消耗在何处。
- 异常发现:step → 设备忙碌区间合并 → 气泡检测 → 异常标签 → 软归因 — 回答哪些现象不符合预期,潜在问题可能出现在哪里。
- 模型架构分析:FIA时间线 → pass边界 → 层分类 → 每层子结构 → 通信流水线 → 架构总结 — 回答该模型是什么,各组件如何执行。生成独立的Markdown报告文件。
核心理念是关注点分离:“异常存在”是基于设备时间间隔得出的确凿事实;“异常原因”是软归因,可能需要额外证据支撑。即使在弱性能分析配置下(无堆栈、无形状信息、主机事件稀疏),该工具仍需可靠地识别设备空闲气泡并标记风险标签。
Reference Files — When to Read
参考文件 — 阅读时机
Read these before starting analysis:
| File | When to read | What it contains |
|---|---|---|
| Always — read first | Raw data column schemas for kernel_details.csv, op_summary, trace_view.json; the step → structure → block/side → op hierarchy; how to parse, filter, assign kernels at each level; multi-stream handling; per-level timing aggregation |
| Always | Anomaly thresholds, tagging rules, decision tables, soft attribution rules, AICPU classification, wait-anchor rules |
| Always — read before producing the architecture report | Full template for the standalone Markdown architecture report: required sections, formatting rules, analysis techniques for layer classification, communication overlap measurement, per-layer timing breakdowns |
| When producing structured JSON output | Full JSON schema for the |
| When writing analysis scripts | Reference Python implementation for interval merging, bubble metrics, soft attribution, wait-anchor scoring |
开始分析前请阅读以下文件:
| 文件 | 阅读时机 | 内容说明 |
|---|---|---|
| 必须优先阅读 | kernel_details.csv、op_summary、trace_view.json的原始数据列 schema;step → structure → block/side → op的层级关系;各层级内核的解析、过滤、分配方法;多流处理;各层级时序聚合规则 |
| 必须阅读 | 异常阈值、标签规则、决策表、软归因规则、AICPU分类标准、等待锚点规则 |
| 生成架构报告前必须阅读 | 独立Markdown架构报告的完整模板:必填章节、格式规则、层分类分析方法、通信重叠度测量、每层时序拆解技巧 |
| 生成结构化JSON输出时阅读 | |
| 编写分析脚本时阅读 | 区间合并、气泡指标、软归因、等待锚点评分的参考Python实现 |
Pipeline Overview
流水线概述
The full state machine:
INGEST → INVENTORY → FACT_EXTRACTION → CANDIDATE_STEP_DETECTION →
STEP_GROUPING → MACRO_STEP_RESOLUTION → SEGMENTATION →
CLOCK_ACCOUNTING → ANOMALY_DISCOVERY → PERF_JUDGEMENT →
RECOMMENDATION → RENDER → DONE
↘
ARCHITECTURE_ANALYSIS → ARCH_REPORT_RENDER → ARCH_REPORT_SAVEANOMALY_DISCOVERYCLOCK_ACCOUNTINGPERF_JUDGEMENTARCHITECTURE_ANALYSISPERF_JUDGEMENTCLOCK_ACCOUNTING完整状态机:
INGEST → INVENTORY → FACT_EXTRACTION → CANDIDATE_STEP_DETECTION →
STEP_GROUPING → MACRO_STEP_RESOLUTION → SEGMENTATION →
CLOCK_ACCOUNTING → ANOMALY_DISCOVERY → PERF_JUDGEMENT →
RECOMMENDATION → RENDER → DONE
↘
ARCHITECTURE_ANALYSIS → ARCH_REPORT_RENDER → ARCH_REPORT_SAVEANOMALY_DISCOVERYCLOCK_ACCOUNTINGPERF_JUDGEMENTARCHITECTURE_ANALYSISPERF_JUDGEMENTCLOCK_ACCOUNTINGData Hierarchy: Step → Structure → Block/Side → Op
数据层级:Step → Structure → Block/Side → Op
Understanding how raw kernel data maps to each level is essential. Read for the full column schemas and parsing details. Here is the conceptual overview:
references/kernel_data_guide.md理解原始内核数据如何映射到各层级至关重要。请阅读获取完整列schema和解析细节。以下是概念性概述:
references/kernel_data_guide.mdLevel 0: Raw Kernels
层级0:原始内核
Each row in represents a single device kernel execution — one invocation of an AI Core task, an AI CPU task, or an HCCL communication task. Key fields: , , , , , , , , .
kernel_details.csvNameTask TypeStart Time(us)Duration(us)Wait Time(us)Accelerator CoreStream IDInput ShapesOutput Shapeskernel_details.csvNameTask TypeStart Time(us)Duration(us)Wait Time(us)Accelerator CoreStream IDInput ShapesOutput ShapesLevel 1: Step
层级1:Step
A step is one training/inference iteration. Steps are identified by user annotations or markers in . Each step defines a service window . All kernels whose start time falls within this window belong to step .
ProfilerStep#NIteration#Ntrace_view.json[S_i, S_{i+1})iAt step level, compute:
- Total service time, device busy union, underfeed ratio
- Prelaunch gap, tail gap, internal bubbles
- Per-step anomaly tags and soft root-cause labels
一个step代表一次训练/推理迭代。通过中的用户注解或标记识别step。每个step定义一个服务窗口 。所有启动时间落在该窗口内的内核均属于step 。
trace_view.jsonProfilerStep#NIteration#N[S_i, S_{i+1})i在step层级,计算以下指标:
- 总服务时间、设备忙碌合并时间、馈送不足率
- 预启动间隙、尾部间隙、内部气泡
- 每个step的异常标签和软根因标签
Level 2: Structure (Layer)
层级2:Structure(层)
Within a step, kernels form repeating structures — typically corresponding to model layers (e.g., transformer blocks, attention layers, MLP blocks). Segmentation identifies these by:
- Repeating name-pattern sequences in the kernel timeline
- Significant time gaps between kernel groups
- User annotations marking layer boundaries (if present)
Each structure contains a contiguous span of kernels within the step window.
At structure level, compute:
- Structure wall time, device busy union within structure span
- Structure share of total step time
- Per-structure bubble metrics (is the bubble inside this structure or between structures?)
在一个step内,内核形成重复的structures — 通常对应模型层(如Transformer块、注意力层、MLP块)。通过以下方式识别分段:
- 内核时间线中重复的名称模式序列
- 内核组之间的显著时间间隙
- 用户标记的层边界(若存在)
每个structure包含step窗口内一段连续的内核序列。
在structure层级,计算以下指标:
- Structure的壁钟时间、structure时间范围内的设备忙碌合并时间
- Structure占总step时间的比例
- 每个structure的气泡指标(气泡位于structure内部还是structure之间?)
Level 3: Block / Side
层级3:Block / Side
Within each structure, kernels split into:
- Block (main compute path): The dominant chain of AI Core kernels forming the forward or backward pass of this layer. These execute on the main compute stream.
- Side (auxiliary ops): Everything else — small element-wise ops, communication (HCCL), AI CPU fallback ops, memory copies, synchronization events. These may execute on separate streams.
At block/side level, maintain four timing perspectives simultaneously:
- — wall-clock time from first kernel start to last kernel end
wall_ms - — merged device-busy time (accounts for multi-stream overlap)
busy_union_ms - — arithmetic sum of all kernel durations (ignores overlap)
kernel_sum_ms - — sum of
total_cost_msfor all kernelsduration + wait
Conclusions based on only one metric are incomplete. A block appearing heavy in but light in means high stream parallelism. A side appearing heavy in but light in means wait-anchor false hotspot risk.
kernel_sumwalltotal_costduration每个structure内的内核分为两类:
- Block(主计算路径):构成该层前向或反向传播的主导AI Core内核链,在主计算流上执行。
- Side(辅助操作):其他所有操作 — 小型元素级操作、通信(HCCL)、AI CPU回退操作、内存拷贝、同步事件,可能在独立流上执行。
在block/side层级,同时维护四种时序视角:
- — 从第一个内核启动到最后一个内核结束的壁钟时间
wall_ms - — 合并后的设备忙碌时间(考虑多流重叠)
busy_union_ms - — 所有内核持续时间的算术和(忽略重叠)
kernel_sum_ms - — 所有内核的
total_cost_ms之和duration + wait
仅基于单一指标得出的结论是不完整的。若block的数值高但数值低,说明流并行度高;若side的数值高但数值低,说明存在等待锚点伪热点风险。
kernel_sumwalltotal_costdurationLevel 4: Op (individual operator)
层级4:Op(单个算子)
The finest grain. Each op may produce one or many device kernels. Op-level analysis handles:
- Top ops by total cost vs. top ops by kernel duration (these rankings often differ)
- Wait-anchor detection: ops with and tiny
wait_ratio > 0.95but highdurationtotal_cost - AICPU classification: ops running on AI CPU instead of AI Core, classified by
masked_ratio - Small-op initial judgement: whether small individual ops are real inefficiencies or noise
最细粒度层级。每个op可能生成一个或多个设备内核。op级分析处理以下内容:
- 按总成本排序的top op与按内核持续时间排序的top op(两者排名通常不同)
- 等待锚点检测:、
wait_ratio > 0.95极小但duration高的optotal_cost - AICPU分类:在AI CPU而非AI Core上运行的op,按分类
masked_ratio - 小型op初步判定:单个小型op是真正的低效还是噪声
Anomaly Discovery Pipeline (Detail)
异常发现流水线(细节)
Phase 1: BUILD_DEVICE_INTERVALS
阶段1:BUILD_DEVICE_INTERVALS
For each step, collect all device kernel intervals from :
kernel_details.csvdevice_intervals = []
for each kernel row where Start_Time_us is within step window:
s = max(step_start_us, row.Start_Time_us)
e = min(step_end_us, row.Start_Time_us + row.Duration_us)
if e > s:
device_intervals.append(Interval(s, e))Key rules:
- Clip kernel intervals to the step window boundary
- Apply communication dedup rules BEFORE interval statistics (see section on comm dedup)
references/kernel_data_guide.md - Include AI_CORE, AI_CPU, and HCCL tasks — all count as "device busy"
- When multiple streams exist, intervals from all streams are collected into the same set
针对每个step,从收集所有设备内核时间间隔:
kernel_details.csvdevice_intervals = []
for each kernel row where Start_Time_us is within step window:
s = max(step_start_us, row.Start_Time_us)
e = min(step_end_us, row.Start_Time_us + row.Duration_us)
if e > s:
device_intervals.append(Interval(s, e))关键规则:
- 将内核时间间隔裁剪到step窗口边界内
- 在计算间隔统计前应用通信去重规则(参见中通信去重章节)
references/kernel_data_guide.md - 包含AI_CORE、AI_CPU和HCCL任务 — 全部计入“设备忙碌”
- 存在多流时,将所有流的间隔收集到同一集合中
Phase 2: MERGE_INTERVALS
阶段2:MERGE_INTERVALS
Sort intervals by start time, merge overlapping ones:
merged = merge(device_intervals) # see scripts/reference_host_gap_branch.py
busy_union = sum of merged segment durationsThis produces the merged busy segments from which all bubble metrics derive.
按启动时间排序间隔,合并重叠部分:
merged = merge(device_intervals) # 参见scripts/reference_host_gap_branch.py
busy_union = sum of merged segment durations此步骤生成合并后的忙碌段,所有气泡指标均由此推导。
Phase 3: GAP_CLASSIFICATION
阶段3:GAP_CLASSIFICATION
From the merged segments, compute per-step metrics:
- = step window duration
service_ms - = sum of merged segment durations
device_busy_union_ms - = service − busy_union
underfeed_ms - = underfeed / service
underfeed_ratio - = start(first_merged_segment) − step_start
prelaunch_gap_ms - = step_end − end(last_merged_segment)
tail_gap_ms - = sum of gaps between consecutive merged segments
internal_bubble_total_ms - = max gap between consecutive merged segments
largest_internal_bubble_ms - = number of inter-segment gaps
bubble_count
基于合并后的分段,计算每个step的指标:
- = step窗口持续时间
service_ms - = 合并分段的持续时间之和
device_busy_union_ms - = service − busy_union
underfeed_ms - = underfeed / service
underfeed_ratio - = start(first_merged_segment) − step_start
prelaunch_gap_ms - = step_end − end(last_merged_segment)
tail_gap_ms - = 连续合并分段之间的间隙之和
internal_bubble_total_ms - = 连续合并分段之间的最大间隙
largest_internal_bubble_ms - = 段间间隙的数量
bubble_count
Phase 4: HOST_EVIDENCE_COLLECTION
阶段4:HOST_EVIDENCE_COLLECTION
For each bubble window (gap between merged segments, or prelaunch/tail gap), scan the same time range in host events from :
trace_view.jsonHost event categories to collect:
- /
cpu_op/python_function— general host activityuser_annotation - — ACL runtime events
AscendCL@* - /
HostToDevice/torch_to_npu/aclrtMemcpy*— sync/copy markersaclrtSynchronize* - /
c10d/Hccl/hcom/StreamWaitEvent— communication markersNotify_Wait
Compute overlap ratios:
- = fraction of bubble covered by any host event
host_visible_coverage_ratio - = fraction covered by sync/copy markers
sync_marker_overlap_ratio - = fraction covered by communication markers
comm_marker_overlap_ratio
针对每个气泡窗口(合并分段之间的间隙,或预启动/尾部间隙),扫描中同一时间范围的主机事件:
trace_view.json需收集的主机事件类别:
- /
cpu_op/python_function— 通用主机活动user_annotation - — ACL运行时事件
AscendCL@* - /
HostToDevice/torch_to_npu/aclrtMemcpy*— 同步/拷贝标记aclrtSynchronize* - /
c10d/Hccl/hcom/StreamWaitEvent— 通信标记Notify_Wait
计算重叠率:
- = 气泡被任意主机事件覆盖的比例
host_visible_coverage_ratio - = 气泡被同步/拷贝标记覆盖的比例
sync_marker_overlap_ratio - = 气泡被通信标记覆盖的比例
comm_marker_overlap_ratio
Phase 5: SOFT_ATTRIBUTION
阶段5:SOFT_ATTRIBUTION
For each significant bubble, assign probability-level labels based on overlap ratios:
| Condition | Label |
|---|---|
| sync_overlap ≥ 0.20 | |
| comm_overlap ≥ 0.20 | |
| host_coverage < 0.05 | |
| host_coverage ≥ 0.10 but no sync/comm dominance | |
| host_parallelism < 1.2 and none of above | |
| nothing applies | |
Multiple labels can co-exist. These are explicitly NOT unique root causes.
针对每个显著气泡,基于重叠率分配概率级标签:
| 条件 | 标签 |
|---|---|
| sync_overlap ≥ 0.20 | |
| comm_overlap ≥ 0.20 | |
| host_coverage < 0.05 | |
| host_coverage ≥ 0.10 但无同步/通信主导 | |
| host_parallelism < 1.2 且不符合上述条件 | |
| 无符合条件项 | |
可同时存在多个标签,这些标签并非唯一根因。
Phase 6: ANOMALY_TAGGING
阶段6:ANOMALY_TAGGING
Apply the anomaly tags from decision tables. Core tags:
references/rulebook.mdBubble severity: , , ,
DEVICE_IDLE_GAP_HEAVYPRELAUNCH_GAP_HEAVYTAIL_GAP_HEAVYINTERNAL_BUBBLE_HEAVYRisk tags: , , , , , , ,
HOST_ORIGINATED_RISKCOMM_SYNC_RISKWAIT_POLLUTION_RISKWAIT_ANCHOR_FALSE_HOTSPOTAICPU_EXPOSED_RISKUNTRACED_HOST_BLOCKING_RISKPARTIAL_CAPTURE_BOUNDARYVARIABLE_SHAPE_SAME_TEMPLATE应用决策表中的异常标签。核心标签:
references/rulebook.md气泡严重程度:, , ,
DEVICE_IDLE_GAP_HEAVYPRELAUNCH_GAP_HEAVYTAIL_GAP_HEAVYINTERNAL_BUBBLE_HEAVY风险标签:, , , , , , ,
HOST_ORIGINATED_RISKCOMM_SYNC_RISKWAIT_POLLUTION_RISKWAIT_ANCHOR_FALSE_HOTSPOTAICPU_EXPOSED_RISKUNTRACED_HOST_BLOCKING_RISKPARTIAL_CAPTURE_BOUNDARYVARIABLE_SHAPE_SAME_TEMPLATEPhase 7: WAIT_ANCHOR_SCAN
阶段7:WAIT_ANCHOR_SCAN
At op level, scan for false hotspots:
wait_ratio = wait_us / (duration_us + wait_us)
if wait_ratio > 0.95 and duration_us < 10.0 and total_cost_rank <= 10:
tag WAIT_ANCHOR_FALSE_HOTSPOTThese ops absorb idle wait time and appear expensive, but their kernel execution is negligible. Demote them in root-cause ranking.
在op层级扫描伪热点:
wait_ratio = wait_us / (duration_us + wait_us)
if wait_ratio > 0.95 and duration_us < 10.0 and total_cost_rank <= 10:
tag WAIT_ANCHOR_FALSE_HOTSPOT这些op吸收了空闲等待时间,看似开销高昂,但内核执行时间可忽略不计。需在根因排名中降低其优先级。
Phase 8: GROUP_AGGREGATION
阶段8:GROUP_AGGREGATION
Aggregate step-level metrics by :
step_group_id- Compute avg, median, P90, P95 for each bubble metric
- = true if ≥60% of steps in group have
recurring_bubble_patternbubble_count > 0 - = whichever of prelaunch/internal_bubble/tail contributes most
dominant_idle_pattern
按聚合step级指标:
step_group_id- 计算每个气泡指标的平均值、中位数、P90、P95
- = 若≥60%的组内step满足
recurring_bubble_pattern,则为truebubble_count > 0 - = 预启动/内部气泡/尾部间隙中贡献最大的类型
dominant_idle_pattern
Phase 9: PRODUCE_OUTPUT
阶段9:PRODUCE_OUTPUT
Merge anomaly results with structure breakdown. The report MUST include:
合并异常结果与结构拆解数据。报告必须包含:
Hidden Issue Discovery section
潜在问题发现章节
- Dominant step/group bubble metrics (service, busy_union, underfeed_ratio)
- Raw kernel evidence table: for each top bubble window, list the kernel(s) immediately before and after the gap — their names, task types, durations, streams — so the human expert can locate the exact spot in the timeline
- Top 5 bubble windows with timestamps, scope, and host evidence
- Bubble periodicity statistics across the step group
- Host evidence coverage assessment
- Soft root-cause labels with evidence chains
- Follow-up sampling recommendations
- 主导step/组的气泡指标(服务时间、忙碌合并时间、馈送不足率)
- 原始内核证据表:针对每个top气泡窗口,列出间隙前后紧邻的内核名称、任务类型、持续时间、流ID,以便人工专家定位时间线中的精确位置
- 前5个气泡窗口的时间戳、范围和主机证据
- 组内step的气泡周期性统计
- 主机证据覆盖评估
- 带证据链的软根因标签
- 后续采样建议
Bubble-first Summary (fixed 5-question template)
气泡优先总结(固定5问题模板)
- Are there significant device idle bubbles?
- Which step type/group do they concentrate in?
- Are they primarily prelaunch / tail / internal / inter-step?
- Is there significant host-originated risk?
- Is evidence sufficient for root cause? If not, say so explicitly.
- 是否存在显著的设备空闲气泡?
- 气泡集中在哪些step类型/组?
- 气泡主要是预启动、尾部、内部还是step间类型?
- 是否存在显著的主机源风险?
- 根因证据是否充分?若不充分,请明确说明。
Structure-Level Bubble Drill-Down
结构级气泡钻取分析
For the dominant step, break down bubble contributions per structure:
- Which structure(s) contain the largest internal bubbles?
- Are bubbles concentrated at structure boundaries (between layers) or within structures?
- For structures with large bubbles, what are the surrounding kernel names and task types?
针对主导step,按structure拆解气泡贡献:
- 哪些structure包含最大的内部气泡?
- 气泡集中在structure边界(层之间)还是structure内部?
- 对于存在大气泡的structure,其周边内核的名称和任务类型是什么?
Model Architecture Analysis Pipeline (Separate Markdown Report)
模型架构分析流水线(独立Markdown报告)
This pipeline produces a standalone Markdown report file that documents the model architecture as reverse-engineered from profiling data. Read for the full template, formatting rules, and analysis techniques.
references/architecture_report_template.mdThe report is saved as in the profiling or output directory.
model_architecture_report_<profiling_dir_name>.md该流水线生成独立的Markdown报告文件,记录基于性能分析数据逆向得出的模型架构。请阅读获取完整模板、格式规则和分析方法。
references/architecture_report_template.md报告保存为,存储在性能分析或输出目录中。
model_architecture_report_<profiling_dir_name>.mdWhen to produce the architecture report
何时生成架构报告
Always. Every profiling analysis MUST produce this report alongside the anomaly discovery output. The architecture report provides essential context that makes the anomaly findings interpretable.
必须生成。每次性能分析必须同时生成该报告和异常发现输出。架构报告提供必要的上下文,使异常发现结果具备可解释性。
Architecture Analysis Phases
架构分析阶段
Phase A1: FIA_TIMELINE_ANALYSIS
阶段A1:FIA_TIMELINE_ANALYSIS
Use FusedInferAttentionScore (FIA) invocations as the primary structural marker:
- Extract all FIA kernels from (match by name containing
kernel_details.csv)FusedInferAttentionScore - Sort by
Start Time(us) - Classify each FIA as prefill (duration > 10ms) or decode (duration < 1ms)
- Determine pass count:
num_passes = total_FIA / FIA_per_pass - Identify phase transitions by timestamp gaps between prefill and decode FIA clusters
使用FusedInferAttentionScore(FIA)调用作为主要结构标记:
- 从提取所有FIA内核(匹配名称包含
kernel_details.csv的条目)FusedInferAttentionScore - 按排序
Start Time(us) - 将每个FIA分类为预填充(持续时间>10ms)或解码(持续时间<1ms)
- 确定pass数量:
num_passes = total_FIA / FIA_per_pass - 通过预填充和解码FIA集群之间的时间戳间隙识别阶段转换
Phase A2: PASS_BOUNDARY_DETECTION
阶段A2:PASS_BOUNDARY_DETECTION
For each forward pass, determine:
- FIA index range (e.g., #0–#94)
- Time span (absolute timestamps)
- Wall time from first kernel to last kernel
- Average prefill FIA duration
- Total kernel count
Cross-pass variation (FIA duration, wall time) should be noted — it may reveal KV cache growth or memory pressure effects.
针对每个前向pass,确定:
- FIA索引范围(如#0–#94)
- 时间跨度(绝对时间戳)
- 从第一个内核到最后一个内核的壁钟时间
- 平均预填充FIA持续时间
- 内核总数
需记录跨pass的差异(FIA持续时间、壁钟时间),这可能揭示KV缓存增长或内存压力的影响。
Phase A3: LAYER_CLASSIFICATION
阶段A3:LAYER_CLASSIFICATION
For each layer (delimited by consecutive FIA invocations), extract the kernel sequence and classify:
| Classifier kernel | Layer type |
|---|---|
| No MoE markers (no MoeGatingTopK, no DFC, no GroupedMatmul) | Dense |
| DispatchFFNCombine present | MoE+DFC |
| GroupedMatmul + alltoallv present (no DFC) | MoE+GMM |
| Sampling ops (ArgMax, rejection_*) present | includes decode/sampling logic |
Build a summary table: Layer Type | Layer Range | Count | Characteristics
针对每个由连续FIA调用分隔的层,提取内核序列并分类:
| 分类内核 | 层类型 |
|---|---|
| 无MoE标记(无MoeGatingTopK、无DFC、无GroupedMatmul) | Dense |
| 存在DispatchFFNCombine | MoE+DFC |
| 存在GroupedMatmul + alltoallv(无DFC) | MoE+GMM |
| 存在采样操作(ArgMax、rejection_*) | 包含解码/采样逻辑 |
构建汇总表:层类型 | 层范围 | 数量 | 特征
Phase A4: CROSS_VERIFICATION
阶段A4:CROSS_VERIFICATION
Count key ops per pass and verify they match the layer classification:
- FIA count should equal layer count
- DispatchFFNCombine count should match MoE+DFC layer count
- GroupedMatmul count should match MoE+GMM layers + decode layers
- MoeGatingTopK count should match all MoE layers (DFC + GMM + decode)
Discrepancies indicate classification errors — resolve before proceeding.
统计每个pass的关键op数量,验证是否与层分类匹配:
- FIA数量应等于层数量
- DispatchFFNCombine数量应匹配MoE+DFC层数量
- GroupedMatmul数量应匹配MoE+GMM层数量 + 解码层数量
- MoeGatingTopK数量应匹配所有MoE层(DFC + GMM + 解码)
若存在差异,说明分类错误,需在继续前解决。
Phase A5: PER_LAYER_SUBSTRUCTURE
阶段A5:PER_LAYER_SUBSTRUCTURE
For EACH distinct layer type, analyze the kernel execution sequence:
- Group kernels by functional role (attention, projection, TP comm, MoE routing, expert dispatch, expert FFN, post-MoE norm, next-layer prep, EP comm, sampling)
- Measure wall time per functional group
- Identify which stream each group runs on
- Compute timing breakdown: Component | Wall time | Share of layer
- Note anomalies specific to the layer type (warm-up overhead in layer 0, extra kernels in transition layer, AICPU ops that should be on AI_CORE)
Present as kernel sequence trees with timing annotations and stream labels. See the template for the exact tree notation format.
针对每种不同的层类型,分析内核执行序列:
- 按功能角色(注意力、投影、TP通信、MoE路由、专家调度、专家FFN、MoE后归一化、下一层准备、EP通信、采样)分组内核
- 测量每个功能组的壁钟时间
- 识别每个组运行的流
- 计算时序拆解:组件 | 壁钟时间 | 占层时间比例
- 记录特定于层类型的异常(第0层的预热开销、过渡层中的额外内核、应在AI_CORE上执行却运行在AICPU的操作)
以带时序注解和流标签的内核序列树形式呈现。请参考模板中的精确树表示格式。
Phase A6: DECODE_ANALYSIS
阶段A6:DECODE_ANALYSIS
Decode layers require separate analysis because they have fundamentally different cost profiles:
- FIA duration drops dramatically (prefill 28ms → decode 0.2ms)
- Communication (especially all-to-all for EP) often dominates
- Expert compute shifts from fused (DFC) to explicit (GroupedMatmul)
Produce a dominant costs table and explain why the cost profile differs from prefill.
解码层需要单独分析,因其成本特征完全不同:
- FIA持续时间大幅下降(预填充28ms → 解码0.2ms)
- 通信(尤其是EP的all-to-all)通常占主导
- 专家计算从融合(DFC)转为显式(GroupedMatmul)
生成主导成本表,并解释成本特征与预填充不同的原因。
Phase A7: COMMUNICATION_PIPELINE
阶段A7:COMMUNICATION_PIPELINE
Document the multi-stream overlap strategy:
- Map each stream to its purpose (main compute, AI_CPU comm, HCCL, alltoall)
- Measure overlap ratios between streams
- Draw an ASCII pipeline diagram showing how communication hides behind compute
- Compute what fraction of total communication is hidden (overlapped)
- Explain the kernel_sum >> wall relationship
记录多流重叠策略:
- 将每个流映射到其用途(主计算、AI_CPU通信、HCCL、alltoall)
- 测量流之间的重叠率
- 绘制ASCII流水线图,展示通信如何隐藏在计算背后
- 计算总通信中被隐藏(重叠)的比例
- 解释的关系
kernel_sum >> wall
Phase A8: ARCH_REPORT_RENDER
阶段A8:ARCH_REPORT_RENDER
Assemble all findings into the Markdown report following the template in . All 10 required sections must be present:
references/architecture_report_template.md- Configuration Context
- Model Architecture Determination (evidence chain table)
- Forward Pass Boundaries (per-pass table)
- Layer Classification (type table)
- Cross-Verification Table
- Per-Layer Sub-Structure (kernel sequence trees + timing breakdowns for EACH layer type)
- Decode Phase Analysis (dominant costs + prefill vs decode comparison)
- Communication Pipeline Structure (stream table + ASCII pipeline diagram)
- Layer-to-Layer Variation (comparative table)
- Model Architecture Summary (ASCII model diagram + execution timeline)
按照中的模板,将所有发现整理为Markdown报告。必须包含全部10个必填章节:
references/architecture_report_template.md- 配置上下文
- 模型架构判定(证据链表)
- 前向Pass边界(按pass统计表格)
- 层分类(类型表)
- 交叉验证表
- 每层子结构(每种层类型的内核序列树 + 时序拆解)
- 解码阶段分析(主导成本 + 预填充vs解码对比)
- 通信流水线结构(流表 + ASCII流水线图)
- 层间差异(对比表)
- 模型架构总结(ASCII模型图 + 执行时间线)
Phase A9: ARCH_REPORT_SAVE
阶段A9:ARCH_REPORT_SAVE
Save the report as . Inform the user of the file location.
model_architecture_report_<profiling_dir_name>.md将报告保存为,并告知用户文件位置。
model_architecture_report_<profiling_dir_name>.mdCritical Constraints
关键约束
- Never skip anomaly discovery because root cause is unclear. Bubble facts are hard conclusions; root cause labels are soft conclusions. Both must always be reported.
- Never call a high-wait tiny-duration op a real hotspot without checking wait-anchor risk.
- Never ignore local bubbles just because the step is device-bound overall.
- Never use only one timing metric — always maintain dual clock accounting (wall_ms + busy_union_ms + kernel_sum_ms + total_cost_ms) at block/side level.
- Always output or
insufficient_evidencewhen host evidence is sparse — never silently omit the anomaly section.possible_untraced_host_blocking - Always include raw kernel context around reported bubbles — the kernel names, task types, durations, and stream IDs immediately before and after each bubble gap.
- Use layered certainty language: declarative for facts, for root causes.
possible / probable / insufficient evidence - Always produce the model architecture Markdown report as a separate file — never fold it into the anomaly output alone.
- Never skip per-layer sub-structure analysis. Every distinct layer type must have its own kernel sequence tree with timing breakdown and stream annotations.
- Always include the evidence chain table in the architecture report — the reader must see how layer count and pass structure were determined from raw FIA data.
- Always cross-verify op counts against layer classification before finalizing the architecture report. Discrepancies must be resolved or explicitly noted.
- 绝不能因根因不明确而跳过异常发现。气泡事实是确凿结论;根因标签是软结论。两者必须始终报告。
- 绝不能将高等待时间、极小持续时间的op称为真正的热点,必须先检查等待锚点风险。
- 绝不能仅因step整体受设备限制就忽略局部气泡。
- 绝不能仅使用单一时序指标 — 在block/side层级必须始终维护四种时钟统计(wall_ms + busy_union_ms + kernel_sum_ms + total_cost_ms)。
- 当主机证据稀疏时,必须输出或
insufficient_evidence— 绝不能静默省略异常章节。possible_untraced_host_blocking - 必须包含报告气泡的原始内核上下文 — 每个气泡间隙前后紧邻的内核名称、任务类型、持续时间和流ID。
- 使用分层确定性语言:事实用陈述性语言,根因用表述。
possible / probable / insufficient evidence - 必须生成独立的模型架构Markdown报告文件 — 绝不能仅将其嵌入异常输出中。
- 绝不能跳过每层子结构分析。每种不同的层类型必须有自己带时序拆解和流注解的内核序列树。
- 架构报告中必须包含证据链表 — 读者必须能够看到如何从原始FIA数据确定层数量和pass结构。
- 在最终确定架构报告前,必须始终交叉验证op数量与层分类是否匹配。差异必须解决或明确记录。
Graceful Degradation
优雅降级策略
| Missing data | Impact | Action |
|---|---|---|
| Cannot detect shape variation | Bubble detection continues; tag |
| Soft attribution specificity degrades | Lower confidence; bubble detection unaffected |
| Sparse host events | Cannot narrow root-cause family | |
| Capture boundary truncation | Edge gaps may be artifacts | |
No | Cannot assess comm wait | Skip comm overlap, note in evidence gaps. Architecture report omits comm pipeline bandwidth stats but still documents stream roles |
| No step markers | Cannot define step windows | Fall back to global capture span as single pseudo-step |
| Coarser granularity | Use op-level intervals instead; note in limitations. Architecture report uses op counts for layer classification but cannot produce per-layer kernel sequence trees |
| No FIA kernels detected | Cannot determine layer boundaries via FIA | Architecture report falls back to alternative structural markers (e.g., repeating kernel patterns, communication boundaries). Note reduced confidence in layer count |
| Single forward pass captured | Cannot cross-validate pass consistency | Architecture report documents single pass; notes that cross-pass variation analysis is unavailable |
| No decode FIA detected | Inference-only or prefill-only capture | Architecture report omits decode phase analysis section; notes capture scope limitation |
| 缺失数据 | 影响 | 应对措施 |
|---|---|---|
| 无法检测形状变化 | 气泡检测继续;跳过 |
| 软归因的特异性降低 | 降低置信度;气泡检测不受影响 |
| 主机事件稀疏 | 无法缩小根因范围 | 标记 |
| 捕获边界截断 | 边缘间隙可能是伪像 | 对边界相邻间隙标记 |
无 | 无法评估通信等待 | 跳过通信重叠分析,在证据间隙中说明。架构报告省略通信流水线带宽统计,但仍记录流用途 |
| 无step标记 | 无法定义step窗口 | 退而求其次,将全局捕获范围作为单个伪step |
仅存在 | 粒度更粗 | 使用op级间隔替代;在局限性中说明。架构报告使用op数量进行层分类,但无法生成每层内核序列树 |
| 未检测到FIA内核 | 无法通过FIA确定层边界 | 架构报告退而使用替代结构标记(如重复内核模式、通信边界)。在层数量部分说明置信度降低 |
| 仅捕获单个前向pass | 无法交叉验证pass一致性 | 架构报告记录单个pass;说明无法进行跨pass差异分析 |
| 未检测到解码FIA | 仅捕获推理预填充或仅预填充 | 架构报告省略解码阶段分析章节;说明捕获范围限制 |
Output Contract Summary
输出契约总结
Every analysis must produce TWO outputs:
每次分析必须生成两个输出:
1. Anomaly Discovery Output
1. 异常发现输出
The top-level object containing: , , , , , , , , .
anomaly_discoveryenableddominant_group_idglobal_device_gap_analysisstep_group_anomaliesbubble_windowswait_anchor_opssoft_root_cause_summaryrequires_host_followupconfidenceEach step result must include bubble metrics, anomaly tags, and soft root-cause labels.
For the full JSON schema, read .
references/schema.jsonanomaly_discoveryenableddominant_group_idglobal_device_gap_analysisstep_group_anomaliesbubble_windowswait_anchor_opssoft_root_cause_summaryrequires_host_followupconfidence每个step结果必须包含气泡指标、异常标签和软根因标签。
完整JSON schema请阅读。
references/schema.json2. Model Architecture Report (Markdown file)
2. 模型架构报告(Markdown文件)
A standalone Markdown file saved as containing all 10 required sections from the architecture template. Read for the full specification.
model_architecture_report_<profiling_dir_name>.mdreferences/architecture_report_template.mdThe report must include at minimum:
- Evidence chain proving layer count and pass structure
- Layer classification table with all distinct layer types
- Per-layer kernel sequence trees with timing breakdowns for EACH layer type
- Communication pipeline structure with stream overlap diagram
- Model architecture ASCII summary diagram and per-pass execution timeline
The architecture report is the primary deliverable for understanding model structure. It must be self-contained — a reader should be able to understand the full model execution without referring to the anomaly output.
独立的Markdown文件,保存为,包含架构模板中的全部10个必填章节。完整规范请阅读。
model_architecture_report_<profiling_dir_name>.mdreferences/architecture_report_template.md报告至少必须包含:
- 证明层数量和pass结构的证据链
- 包含所有不同层类型的层分类表
- 每种层类型的带时序拆解的内核序列树
- 带流重叠图的通信流水线结构
- ASCII模型架构汇总图和按pass的执行时间线
架构报告是理解模型结构的核心交付物,必须自包含 — 读者无需参考异常输出即可理解完整模型执行逻辑。
Recommendations
建议
Each recommendation must include (global/step_group/structure/side/op), , , and (P0–P3).
scopefollowup_requiredevidence_gappriorityCommon follow-up patterns:
- High bubble + missing stacks → re-profile with
with_stack=true - Missing shapes + unstable grouping →
record_shapes=true - High host risk + low evidence → host-side sampling / thread view
- High wait pollution → check communication and sync paths
- Large inter-structure bubble → check host-side layer dispatch latency
每个建议必须包含(全局/step_group/structure/side/op)、、和(P0–P3)。
scopefollowup_requiredevidence_gappriority常见后续处理模式:
- 高气泡 + 缺失堆栈 → 重新执行性能分析并设置
with_stack=true - 缺失形状信息 + 分组不稳定 → 设置
record_shapes=true - 高主机风险 + 证据不足 → 主机端采样/线程视图分析
- 高等待污染 → 检查通信和同步路径
- 大型结构间气泡 → 检查主机端层调度延迟