ascend-profiling-anomaly

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Ascend Profiling Anomaly Discovery Skill

Ascend性能分析异常发现工具

Purpose

用途

Analyze Ascend NPU profiling data through three parallel pipelines:
  1. Structure breakdown: step → structure (layer) → block / side → op → PMU judgement — answers where the time goes.
  2. Anomaly discovery: step → device busy union → bubble detection → anomaly tags → soft attribution — answers what looks unnatural and where hidden issues may lurk.
  3. Model architecture analysis: FIA timeline → pass boundaries → layer classification → per-layer sub-structure → communication pipeline → architecture summary — answers what is this model and how does each component execute. Produces a separate Markdown report file.
The core philosophy is separation of concerns: "anomaly exists" is a hard fact derived from device intervals; "why it exists" is a soft attribution that may require additional evidence. Even under weak profiling configurations (no stacks, no shapes, sparse host events), the skill must still reliably surface device idle bubbles and risk labels.
通过三个并行流水线分析Ascend NPU性能分析数据:
  1. 结构拆解:step → structure(层)→ block / side → op → PMU判定 — 回答时间消耗在何处
  2. 异常发现:step → 设备忙碌区间合并 → 气泡检测 → 异常标签 → 软归因 — 回答哪些现象不符合预期,潜在问题可能出现在哪里
  3. 模型架构分析:FIA时间线 → pass边界 → 层分类 → 每层子结构 → 通信流水线 → 架构总结 — 回答该模型是什么,各组件如何执行生成独立的Markdown报告文件
核心理念是关注点分离:“异常存在”是基于设备时间间隔得出的确凿事实;“异常原因”是软归因,可能需要额外证据支撑。即使在弱性能分析配置下(无堆栈、无形状信息、主机事件稀疏),该工具仍需可靠地识别设备空闲气泡并标记风险标签。

Reference Files — When to Read

参考文件 — 阅读时机

Read these before starting analysis:
FileWhen to readWhat it contains
references/kernel_data_guide.md
Always — read firstRaw data column schemas for kernel_details.csv, op_summary, trace_view.json; the step → structure → block/side → op hierarchy; how to parse, filter, assign kernels at each level; multi-stream handling; per-level timing aggregation
references/rulebook.md
AlwaysAnomaly thresholds, tagging rules, decision tables, soft attribution rules, AICPU classification, wait-anchor rules
references/architecture_report_template.md
Always — read before producing the architecture reportFull template for the standalone Markdown architecture report: required sections, formatting rules, analysis techniques for layer classification, communication overlap measurement, per-layer timing breakdowns
references/schema.json
When producing structured JSON outputFull JSON schema for the
anomaly_discovery
output object
scripts/reference_host_gap_branch.py
When writing analysis scriptsReference Python implementation for interval merging, bubble metrics, soft attribution, wait-anchor scoring
开始分析前请阅读以下文件:
文件阅读时机内容说明
references/kernel_data_guide.md
必须优先阅读kernel_details.csv、op_summary、trace_view.json的原始数据列 schema;step → structure → block/side → op的层级关系;各层级内核的解析、过滤、分配方法;多流处理;各层级时序聚合规则
references/rulebook.md
必须阅读异常阈值、标签规则、决策表、软归因规则、AICPU分类标准、等待锚点规则
references/architecture_report_template.md
生成架构报告前必须阅读独立Markdown架构报告的完整模板:必填章节、格式规则、层分类分析方法、通信重叠度测量、每层时序拆解技巧
references/schema.json
生成结构化JSON输出时阅读
anomaly_discovery
输出对象的完整JSON schema
scripts/reference_host_gap_branch.py
编写分析脚本时阅读区间合并、气泡指标、软归因、等待锚点评分的参考Python实现

Pipeline Overview

流水线概述

The full state machine:
INGEST → INVENTORY → FACT_EXTRACTION → CANDIDATE_STEP_DETECTION →
STEP_GROUPING → MACRO_STEP_RESOLUTION → SEGMENTATION →
CLOCK_ACCOUNTING → ANOMALY_DISCOVERY → PERF_JUDGEMENT →
RECOMMENDATION → RENDER → DONE
                          ARCHITECTURE_ANALYSIS → ARCH_REPORT_RENDER → ARCH_REPORT_SAVE
ANOMALY_DISCOVERY
sits after
CLOCK_ACCOUNTING
and before
PERF_JUDGEMENT
. It receives already-segmented steps and structures, and runs the bubble detection pipeline on top of them.
ARCHITECTURE_ANALYSIS
runs in parallel with
PERF_JUDGEMENT
, using the same segmented data from
CLOCK_ACCOUNTING
plus FIA timeline analysis. It produces a separate Markdown report file saved alongside the profiling data.

完整状态机:
INGEST → INVENTORY → FACT_EXTRACTION → CANDIDATE_STEP_DETECTION →
STEP_GROUPING → MACRO_STEP_RESOLUTION → SEGMENTATION →
CLOCK_ACCOUNTING → ANOMALY_DISCOVERY → PERF_JUDGEMENT →
RECOMMENDATION → RENDER → DONE
                          ARCHITECTURE_ANALYSIS → ARCH_REPORT_RENDER → ARCH_REPORT_SAVE
ANOMALY_DISCOVERY
位于
CLOCK_ACCOUNTING
之后、
PERF_JUDGEMENT
之前,接收已完成分段的step和structure数据,并在其基础上运行气泡检测流水线。
ARCHITECTURE_ANALYSIS
PERF_JUDGEMENT
并行运行,使用
CLOCK_ACCOUNTING
生成的分段数据以及FIA时间线分析结果,生成独立的Markdown报告文件,保存到性能分析数据或输出目录中。

Data Hierarchy: Step → Structure → Block/Side → Op

数据层级:Step → Structure → Block/Side → Op

Understanding how raw kernel data maps to each level is essential. Read
references/kernel_data_guide.md
for the full column schemas and parsing details. Here is the conceptual overview:
理解原始内核数据如何映射到各层级至关重要。请阅读
references/kernel_data_guide.md
获取完整列schema和解析细节。以下是概念性概述:

Level 0: Raw Kernels

层级0:原始内核

Each row in
kernel_details.csv
represents a single device kernel execution — one invocation of an AI Core task, an AI CPU task, or an HCCL communication task. Key fields:
Name
,
Task Type
,
Start Time(us)
,
Duration(us)
,
Wait Time(us)
,
Accelerator Core
,
Stream ID
,
Input Shapes
,
Output Shapes
.
kernel_details.csv
中的每一行代表一次设备内核执行 — AI Core任务、AI CPU任务或HCCL通信任务的一次调用。关键字段:
Name
Task Type
Start Time(us)
Duration(us)
Wait Time(us)
Accelerator Core
Stream ID
Input Shapes
Output Shapes

Level 1: Step

层级1:Step

A step is one training/inference iteration. Steps are identified by
ProfilerStep#N
user annotations or
Iteration#N
markers in
trace_view.json
. Each step defines a service window
[S_i, S_{i+1})
. All kernels whose start time falls within this window belong to step
i
.
At step level, compute:
  • Total service time, device busy union, underfeed ratio
  • Prelaunch gap, tail gap, internal bubbles
  • Per-step anomaly tags and soft root-cause labels
一个step代表一次训练/推理迭代。通过
trace_view.json
中的
ProfilerStep#N
用户注解或
Iteration#N
标记识别step。每个step定义一个服务窗口
[S_i, S_{i+1})
。所有启动时间落在该窗口内的内核均属于step
i
在step层级,计算以下指标:
  • 总服务时间、设备忙碌合并时间、馈送不足率
  • 预启动间隙、尾部间隙、内部气泡
  • 每个step的异常标签和软根因标签

Level 2: Structure (Layer)

层级2:Structure(层)

Within a step, kernels form repeating structures — typically corresponding to model layers (e.g., transformer blocks, attention layers, MLP blocks). Segmentation identifies these by:
  1. Repeating name-pattern sequences in the kernel timeline
  2. Significant time gaps between kernel groups
  3. User annotations marking layer boundaries (if present)
Each structure contains a contiguous span of kernels within the step window.
At structure level, compute:
  • Structure wall time, device busy union within structure span
  • Structure share of total step time
  • Per-structure bubble metrics (is the bubble inside this structure or between structures?)
在一个step内,内核形成重复的structures — 通常对应模型层(如Transformer块、注意力层、MLP块)。通过以下方式识别分段:
  1. 内核时间线中重复的名称模式序列
  2. 内核组之间的显著时间间隙
  3. 用户标记的层边界(若存在)
每个structure包含step窗口内一段连续的内核序列。
在structure层级,计算以下指标:
  • Structure的壁钟时间、structure时间范围内的设备忙碌合并时间
  • Structure占总step时间的比例
  • 每个structure的气泡指标(气泡位于structure内部还是structure之间?)

Level 3: Block / Side

层级3:Block / Side

Within each structure, kernels split into:
  • Block (main compute path): The dominant chain of AI Core kernels forming the forward or backward pass of this layer. These execute on the main compute stream.
  • Side (auxiliary ops): Everything else — small element-wise ops, communication (HCCL), AI CPU fallback ops, memory copies, synchronization events. These may execute on separate streams.
At block/side level, maintain four timing perspectives simultaneously:
  • wall_ms
    — wall-clock time from first kernel start to last kernel end
  • busy_union_ms
    — merged device-busy time (accounts for multi-stream overlap)
  • kernel_sum_ms
    — arithmetic sum of all kernel durations (ignores overlap)
  • total_cost_ms
    — sum of
    duration + wait
    for all kernels
Conclusions based on only one metric are incomplete. A block appearing heavy in
kernel_sum
but light in
wall
means high stream parallelism. A side appearing heavy in
total_cost
but light in
duration
means wait-anchor false hotspot risk.
每个structure内的内核分为两类:
  • Block(主计算路径):构成该层前向或反向传播的主导AI Core内核链,在主计算流上执行。
  • Side(辅助操作):其他所有操作 — 小型元素级操作、通信(HCCL)、AI CPU回退操作、内存拷贝、同步事件,可能在独立流上执行。
在block/side层级,同时维护四种时序视角
  • wall_ms
    — 从第一个内核启动到最后一个内核结束的壁钟时间
  • busy_union_ms
    — 合并后的设备忙碌时间(考虑多流重叠)
  • kernel_sum_ms
    — 所有内核持续时间的算术和(忽略重叠)
  • total_cost_ms
    — 所有内核的
    duration + wait
    之和
仅基于单一指标得出的结论是不完整的。若block的
kernel_sum
数值高但
wall
数值低,说明流并行度高;若side的
total_cost
数值高但
duration
数值低,说明存在等待锚点伪热点风险。

Level 4: Op (individual operator)

层级4:Op(单个算子)

The finest grain. Each op may produce one or many device kernels. Op-level analysis handles:
  • Top ops by total cost vs. top ops by kernel duration (these rankings often differ)
  • Wait-anchor detection: ops with
    wait_ratio > 0.95
    and tiny
    duration
    but high
    total_cost
  • AICPU classification: ops running on AI CPU instead of AI Core, classified by
    masked_ratio
  • Small-op initial judgement: whether small individual ops are real inefficiencies or noise

最细粒度层级。每个op可能生成一个或多个设备内核。op级分析处理以下内容:
  • 按总成本排序的top op与按内核持续时间排序的top op(两者排名通常不同)
  • 等待锚点检测:
    wait_ratio > 0.95
    duration
    极小但
    total_cost
    高的op
  • AICPU分类:在AI CPU而非AI Core上运行的op,按
    masked_ratio
    分类
  • 小型op初步判定:单个小型op是真正的低效还是噪声

Anomaly Discovery Pipeline (Detail)

异常发现流水线(细节)

Phase 1: BUILD_DEVICE_INTERVALS

阶段1:BUILD_DEVICE_INTERVALS

For each step, collect all device kernel intervals from
kernel_details.csv
:
device_intervals = []
for each kernel row where Start_Time_us is within step window:
    s = max(step_start_us, row.Start_Time_us)
    e = min(step_end_us, row.Start_Time_us + row.Duration_us)
    if e > s:
        device_intervals.append(Interval(s, e))
Key rules:
  • Clip kernel intervals to the step window boundary
  • Apply communication dedup rules BEFORE interval statistics (see
    references/kernel_data_guide.md
    section on comm dedup)
  • Include AI_CORE, AI_CPU, and HCCL tasks — all count as "device busy"
  • When multiple streams exist, intervals from all streams are collected into the same set
针对每个step,从
kernel_details.csv
收集所有设备内核时间间隔:
device_intervals = []
for each kernel row where Start_Time_us is within step window:
    s = max(step_start_us, row.Start_Time_us)
    e = min(step_end_us, row.Start_Time_us + row.Duration_us)
    if e > s:
        device_intervals.append(Interval(s, e))
关键规则:
  • 将内核时间间隔裁剪到step窗口边界内
  • 在计算间隔统计前应用通信去重规则(参见
    references/kernel_data_guide.md
    中通信去重章节)
  • 包含AI_CORE、AI_CPU和HCCL任务 — 全部计入“设备忙碌”
  • 存在多流时,将所有流的间隔收集到同一集合中

Phase 2: MERGE_INTERVALS

阶段2:MERGE_INTERVALS

Sort intervals by start time, merge overlapping ones:
merged = merge(device_intervals)  # see scripts/reference_host_gap_branch.py
busy_union = sum of merged segment durations
This produces the merged busy segments from which all bubble metrics derive.
按启动时间排序间隔,合并重叠部分:
merged = merge(device_intervals)  # 参见scripts/reference_host_gap_branch.py
busy_union = sum of merged segment durations
此步骤生成合并后的忙碌段,所有气泡指标均由此推导。

Phase 3: GAP_CLASSIFICATION

阶段3:GAP_CLASSIFICATION

From the merged segments, compute per-step metrics:
  • service_ms
    = step window duration
  • device_busy_union_ms
    = sum of merged segment durations
  • underfeed_ms
    = service − busy_union
  • underfeed_ratio
    = underfeed / service
  • prelaunch_gap_ms
    = start(first_merged_segment) − step_start
  • tail_gap_ms
    = step_end − end(last_merged_segment)
  • internal_bubble_total_ms
    = sum of gaps between consecutive merged segments
  • largest_internal_bubble_ms
    = max gap between consecutive merged segments
  • bubble_count
    = number of inter-segment gaps
基于合并后的分段,计算每个step的指标:
  • service_ms
    = step窗口持续时间
  • device_busy_union_ms
    = 合并分段的持续时间之和
  • underfeed_ms
    = service − busy_union
  • underfeed_ratio
    = underfeed / service
  • prelaunch_gap_ms
    = start(first_merged_segment) − step_start
  • tail_gap_ms
    = step_end − end(last_merged_segment)
  • internal_bubble_total_ms
    = 连续合并分段之间的间隙之和
  • largest_internal_bubble_ms
    = 连续合并分段之间的最大间隙
  • bubble_count
    = 段间间隙的数量

Phase 4: HOST_EVIDENCE_COLLECTION

阶段4:HOST_EVIDENCE_COLLECTION

For each bubble window (gap between merged segments, or prelaunch/tail gap), scan the same time range in host events from
trace_view.json
:
Host event categories to collect:
  • cpu_op
    /
    python_function
    /
    user_annotation
    — general host activity
  • AscendCL@*
    — ACL runtime events
  • HostToDevice
    /
    torch_to_npu
    /
    aclrtMemcpy*
    /
    aclrtSynchronize*
    — sync/copy markers
  • c10d
    /
    Hccl
    /
    hcom
    /
    StreamWaitEvent
    /
    Notify_Wait
    — communication markers
Compute overlap ratios:
  • host_visible_coverage_ratio
    = fraction of bubble covered by any host event
  • sync_marker_overlap_ratio
    = fraction covered by sync/copy markers
  • comm_marker_overlap_ratio
    = fraction covered by communication markers
针对每个气泡窗口(合并分段之间的间隙,或预启动/尾部间隙),扫描
trace_view.json
中同一时间范围的主机事件:
需收集的主机事件类别:
  • cpu_op
    /
    python_function
    /
    user_annotation
    — 通用主机活动
  • AscendCL@*
    — ACL运行时事件
  • HostToDevice
    /
    torch_to_npu
    /
    aclrtMemcpy*
    /
    aclrtSynchronize*
    — 同步/拷贝标记
  • c10d
    /
    Hccl
    /
    hcom
    /
    StreamWaitEvent
    /
    Notify_Wait
    — 通信标记
计算重叠率:
  • host_visible_coverage_ratio
    = 气泡被任意主机事件覆盖的比例
  • sync_marker_overlap_ratio
    = 气泡被同步/拷贝标记覆盖的比例
  • comm_marker_overlap_ratio
    = 气泡被通信标记覆盖的比例

Phase 5: SOFT_ATTRIBUTION

阶段5:SOFT_ATTRIBUTION

For each significant bubble, assign probability-level labels based on overlap ratios:
ConditionLabel
sync_overlap ≥ 0.20
possible_sync_or_h2d
comm_overlap ≥ 0.20
possible_comm_wait
host_coverage < 0.05
possible_untraced_host_blocking
host_coverage ≥ 0.10 but no sync/comm dominance
possible_host_launch_lag
host_parallelism < 1.2 and none of above
possible_python_serialization_or_lock
nothing applies
insufficient_evidence
Multiple labels can co-exist. These are explicitly NOT unique root causes.
针对每个显著气泡,基于重叠率分配概率级标签:
条件标签
sync_overlap ≥ 0.20
possible_sync_or_h2d
comm_overlap ≥ 0.20
possible_comm_wait
host_coverage < 0.05
possible_untraced_host_blocking
host_coverage ≥ 0.10 但无同步/通信主导
possible_host_launch_lag
host_parallelism < 1.2 且不符合上述条件
possible_python_serialization_or_lock
无符合条件项
insufficient_evidence
可同时存在多个标签,这些标签并非唯一根因。

Phase 6: ANOMALY_TAGGING

阶段6:ANOMALY_TAGGING

Apply the anomaly tags from
references/rulebook.md
decision tables. Core tags:
Bubble severity:
DEVICE_IDLE_GAP_HEAVY
,
PRELAUNCH_GAP_HEAVY
,
TAIL_GAP_HEAVY
,
INTERNAL_BUBBLE_HEAVY
Risk tags:
HOST_ORIGINATED_RISK
,
COMM_SYNC_RISK
,
WAIT_POLLUTION_RISK
,
WAIT_ANCHOR_FALSE_HOTSPOT
,
AICPU_EXPOSED_RISK
,
UNTRACED_HOST_BLOCKING_RISK
,
PARTIAL_CAPTURE_BOUNDARY
,
VARIABLE_SHAPE_SAME_TEMPLATE
应用
references/rulebook.md
决策表中的异常标签。核心标签:
气泡严重程度
DEVICE_IDLE_GAP_HEAVY
,
PRELAUNCH_GAP_HEAVY
,
TAIL_GAP_HEAVY
,
INTERNAL_BUBBLE_HEAVY
风险标签
HOST_ORIGINATED_RISK
,
COMM_SYNC_RISK
,
WAIT_POLLUTION_RISK
,
WAIT_ANCHOR_FALSE_HOTSPOT
,
AICPU_EXPOSED_RISK
,
UNTRACED_HOST_BLOCKING_RISK
,
PARTIAL_CAPTURE_BOUNDARY
,
VARIABLE_SHAPE_SAME_TEMPLATE

Phase 7: WAIT_ANCHOR_SCAN

阶段7:WAIT_ANCHOR_SCAN

At op level, scan for false hotspots:
wait_ratio = wait_us / (duration_us + wait_us)
if wait_ratio > 0.95 and duration_us < 10.0 and total_cost_rank <= 10:
    tag WAIT_ANCHOR_FALSE_HOTSPOT
These ops absorb idle wait time and appear expensive, but their kernel execution is negligible. Demote them in root-cause ranking.
在op层级扫描伪热点:
wait_ratio = wait_us / (duration_us + wait_us)
if wait_ratio > 0.95 and duration_us < 10.0 and total_cost_rank <= 10:
    tag WAIT_ANCHOR_FALSE_HOTSPOT
这些op吸收了空闲等待时间,看似开销高昂,但内核执行时间可忽略不计。需在根因排名中降低其优先级。

Phase 8: GROUP_AGGREGATION

阶段8:GROUP_AGGREGATION

Aggregate step-level metrics by
step_group_id
:
  • Compute avg, median, P90, P95 for each bubble metric
  • recurring_bubble_pattern
    = true if ≥60% of steps in group have
    bubble_count > 0
  • dominant_idle_pattern
    = whichever of prelaunch/internal_bubble/tail contributes most
step_group_id
聚合step级指标:
  • 计算每个气泡指标的平均值、中位数、P90、P95
  • recurring_bubble_pattern
    = 若≥60%的组内step满足
    bubble_count > 0
    ,则为true
  • dominant_idle_pattern
    = 预启动/内部气泡/尾部间隙中贡献最大的类型

Phase 9: PRODUCE_OUTPUT

阶段9:PRODUCE_OUTPUT

Merge anomaly results with structure breakdown. The report MUST include:
合并异常结果与结构拆解数据。报告必须包含:

Hidden Issue Discovery section

潜在问题发现章节

  • Dominant step/group bubble metrics (service, busy_union, underfeed_ratio)
  • Raw kernel evidence table: for each top bubble window, list the kernel(s) immediately before and after the gap — their names, task types, durations, streams — so the human expert can locate the exact spot in the timeline
  • Top 5 bubble windows with timestamps, scope, and host evidence
  • Bubble periodicity statistics across the step group
  • Host evidence coverage assessment
  • Soft root-cause labels with evidence chains
  • Follow-up sampling recommendations
  • 主导step/组的气泡指标(服务时间、忙碌合并时间、馈送不足率)
  • 原始内核证据表:针对每个top气泡窗口,列出间隙前后紧邻的内核名称、任务类型、持续时间、流ID,以便人工专家定位时间线中的精确位置
  • 前5个气泡窗口的时间戳、范围和主机证据
  • 组内step的气泡周期性统计
  • 主机证据覆盖评估
  • 带证据链的软根因标签
  • 后续采样建议

Bubble-first Summary (fixed 5-question template)

气泡优先总结(固定5问题模板)

  1. Are there significant device idle bubbles?
  2. Which step type/group do they concentrate in?
  3. Are they primarily prelaunch / tail / internal / inter-step?
  4. Is there significant host-originated risk?
  5. Is evidence sufficient for root cause? If not, say so explicitly.
  1. 是否存在显著的设备空闲气泡?
  2. 气泡集中在哪些step类型/组?
  3. 气泡主要是预启动、尾部、内部还是step间类型?
  4. 是否存在显著的主机源风险?
  5. 根因证据是否充分?若不充分,请明确说明。

Structure-Level Bubble Drill-Down

结构级气泡钻取分析

For the dominant step, break down bubble contributions per structure:
  • Which structure(s) contain the largest internal bubbles?
  • Are bubbles concentrated at structure boundaries (between layers) or within structures?
  • For structures with large bubbles, what are the surrounding kernel names and task types?

针对主导step,按structure拆解气泡贡献:
  • 哪些structure包含最大的内部气泡?
  • 气泡集中在structure边界(层之间)还是structure内部?
  • 对于存在大气泡的structure,其周边内核的名称和任务类型是什么?

Model Architecture Analysis Pipeline (Separate Markdown Report)

模型架构分析流水线(独立Markdown报告)

This pipeline produces a standalone Markdown report file that documents the model architecture as reverse-engineered from profiling data. Read
references/architecture_report_template.md
for the full template, formatting rules, and analysis techniques.
The report is saved as
model_architecture_report_<profiling_dir_name>.md
in the profiling or output directory.
该流水线生成独立的Markdown报告文件,记录基于性能分析数据逆向得出的模型架构。请阅读
references/architecture_report_template.md
获取完整模板、格式规则和分析方法。
报告保存为
model_architecture_report_<profiling_dir_name>.md
,存储在性能分析或输出目录中。

When to produce the architecture report

何时生成架构报告

Always. Every profiling analysis MUST produce this report alongside the anomaly discovery output. The architecture report provides essential context that makes the anomaly findings interpretable.
必须生成。每次性能分析必须同时生成该报告和异常发现输出。架构报告提供必要的上下文,使异常发现结果具备可解释性。

Architecture Analysis Phases

架构分析阶段

Phase A1: FIA_TIMELINE_ANALYSIS

阶段A1:FIA_TIMELINE_ANALYSIS

Use FusedInferAttentionScore (FIA) invocations as the primary structural marker:
  1. Extract all FIA kernels from
    kernel_details.csv
    (match by name containing
    FusedInferAttentionScore
    )
  2. Sort by
    Start Time(us)
  3. Classify each FIA as prefill (duration > 10ms) or decode (duration < 1ms)
  4. Determine pass count:
    num_passes = total_FIA / FIA_per_pass
  5. Identify phase transitions by timestamp gaps between prefill and decode FIA clusters
使用FusedInferAttentionScore(FIA)调用作为主要结构标记:
  1. kernel_details.csv
    提取所有FIA内核(匹配名称包含
    FusedInferAttentionScore
    的条目)
  2. Start Time(us)
    排序
  3. 将每个FIA分类为预填充(持续时间>10ms)或解码(持续时间<1ms)
  4. 确定pass数量:
    num_passes = total_FIA / FIA_per_pass
  5. 通过预填充和解码FIA集群之间的时间戳间隙识别阶段转换

Phase A2: PASS_BOUNDARY_DETECTION

阶段A2:PASS_BOUNDARY_DETECTION

For each forward pass, determine:
  • FIA index range (e.g., #0–#94)
  • Time span (absolute timestamps)
  • Wall time from first kernel to last kernel
  • Average prefill FIA duration
  • Total kernel count
Cross-pass variation (FIA duration, wall time) should be noted — it may reveal KV cache growth or memory pressure effects.
针对每个前向pass,确定:
  • FIA索引范围(如#0–#94)
  • 时间跨度(绝对时间戳)
  • 从第一个内核到最后一个内核的壁钟时间
  • 平均预填充FIA持续时间
  • 内核总数
需记录跨pass的差异(FIA持续时间、壁钟时间),这可能揭示KV缓存增长或内存压力的影响。

Phase A3: LAYER_CLASSIFICATION

阶段A3:LAYER_CLASSIFICATION

For each layer (delimited by consecutive FIA invocations), extract the kernel sequence and classify:
Classifier kernelLayer type
No MoE markers (no MoeGatingTopK, no DFC, no GroupedMatmul)Dense
DispatchFFNCombine presentMoE+DFC
GroupedMatmul + alltoallv present (no DFC)MoE+GMM
Sampling ops (ArgMax, rejection_*) presentincludes decode/sampling logic
Build a summary table: Layer Type | Layer Range | Count | Characteristics
针对每个由连续FIA调用分隔的层,提取内核序列并分类:
分类内核层类型
无MoE标记(无MoeGatingTopK、无DFC、无GroupedMatmul)Dense
存在DispatchFFNCombineMoE+DFC
存在GroupedMatmul + alltoallv(无DFC)MoE+GMM
存在采样操作(ArgMax、rejection_*)包含解码/采样逻辑
构建汇总表:层类型 | 层范围 | 数量 | 特征

Phase A4: CROSS_VERIFICATION

阶段A4:CROSS_VERIFICATION

Count key ops per pass and verify they match the layer classification:
  • FIA count should equal layer count
  • DispatchFFNCombine count should match MoE+DFC layer count
  • GroupedMatmul count should match MoE+GMM layers + decode layers
  • MoeGatingTopK count should match all MoE layers (DFC + GMM + decode)
Discrepancies indicate classification errors — resolve before proceeding.
统计每个pass的关键op数量,验证是否与层分类匹配:
  • FIA数量应等于层数量
  • DispatchFFNCombine数量应匹配MoE+DFC层数量
  • GroupedMatmul数量应匹配MoE+GMM层数量 + 解码层数量
  • MoeGatingTopK数量应匹配所有MoE层(DFC + GMM + 解码)
若存在差异,说明分类错误,需在继续前解决。

Phase A5: PER_LAYER_SUBSTRUCTURE

阶段A5:PER_LAYER_SUBSTRUCTURE

For EACH distinct layer type, analyze the kernel execution sequence:
  1. Group kernels by functional role (attention, projection, TP comm, MoE routing, expert dispatch, expert FFN, post-MoE norm, next-layer prep, EP comm, sampling)
  2. Measure wall time per functional group
  3. Identify which stream each group runs on
  4. Compute timing breakdown: Component | Wall time | Share of layer
  5. Note anomalies specific to the layer type (warm-up overhead in layer 0, extra kernels in transition layer, AICPU ops that should be on AI_CORE)
Present as kernel sequence trees with timing annotations and stream labels. See the template for the exact tree notation format.
针对每种不同的层类型,分析内核执行序列:
  1. 按功能角色(注意力、投影、TP通信、MoE路由、专家调度、专家FFN、MoE后归一化、下一层准备、EP通信、采样)分组内核
  2. 测量每个功能组的壁钟时间
  3. 识别每个组运行的流
  4. 计算时序拆解:组件 | 壁钟时间 | 占层时间比例
  5. 记录特定于层类型的异常(第0层的预热开销、过渡层中的额外内核、应在AI_CORE上执行却运行在AICPU的操作)
以带时序注解和流标签的内核序列树形式呈现。请参考模板中的精确树表示格式。

Phase A6: DECODE_ANALYSIS

阶段A6:DECODE_ANALYSIS

Decode layers require separate analysis because they have fundamentally different cost profiles:
  • FIA duration drops dramatically (prefill 28ms → decode 0.2ms)
  • Communication (especially all-to-all for EP) often dominates
  • Expert compute shifts from fused (DFC) to explicit (GroupedMatmul)
Produce a dominant costs table and explain why the cost profile differs from prefill.
解码层需要单独分析,因其成本特征完全不同:
  • FIA持续时间大幅下降(预填充28ms → 解码0.2ms)
  • 通信(尤其是EP的all-to-all)通常占主导
  • 专家计算从融合(DFC)转为显式(GroupedMatmul)
生成主导成本表,并解释成本特征与预填充不同的原因。

Phase A7: COMMUNICATION_PIPELINE

阶段A7:COMMUNICATION_PIPELINE

Document the multi-stream overlap strategy:
  1. Map each stream to its purpose (main compute, AI_CPU comm, HCCL, alltoall)
  2. Measure overlap ratios between streams
  3. Draw an ASCII pipeline diagram showing how communication hides behind compute
  4. Compute what fraction of total communication is hidden (overlapped)
  5. Explain the kernel_sum >> wall relationship
记录多流重叠策略:
  1. 将每个流映射到其用途(主计算、AI_CPU通信、HCCL、alltoall)
  2. 测量流之间的重叠率
  3. 绘制ASCII流水线图,展示通信如何隐藏在计算背后
  4. 计算总通信中被隐藏(重叠)的比例
  5. 解释
    kernel_sum >> wall
    的关系

Phase A8: ARCH_REPORT_RENDER

阶段A8:ARCH_REPORT_RENDER

Assemble all findings into the Markdown report following the template in
references/architecture_report_template.md
. All 10 required sections must be present:
  1. Configuration Context
  2. Model Architecture Determination (evidence chain table)
  3. Forward Pass Boundaries (per-pass table)
  4. Layer Classification (type table)
  5. Cross-Verification Table
  6. Per-Layer Sub-Structure (kernel sequence trees + timing breakdowns for EACH layer type)
  7. Decode Phase Analysis (dominant costs + prefill vs decode comparison)
  8. Communication Pipeline Structure (stream table + ASCII pipeline diagram)
  9. Layer-to-Layer Variation (comparative table)
  10. Model Architecture Summary (ASCII model diagram + execution timeline)
按照
references/architecture_report_template.md
中的模板,将所有发现整理为Markdown报告。必须包含全部10个必填章节:
  1. 配置上下文
  2. 模型架构判定(证据链表)
  3. 前向Pass边界(按pass统计表格)
  4. 层分类(类型表)
  5. 交叉验证表
  6. 每层子结构(每种层类型的内核序列树 + 时序拆解)
  7. 解码阶段分析(主导成本 + 预填充vs解码对比)
  8. 通信流水线结构(流表 + ASCII流水线图)
  9. 层间差异(对比表)
  10. 模型架构总结(ASCII模型图 + 执行时间线)

Phase A9: ARCH_REPORT_SAVE

阶段A9:ARCH_REPORT_SAVE

Save the report as
model_architecture_report_<profiling_dir_name>.md
. Inform the user of the file location.

将报告保存为
model_architecture_report_<profiling_dir_name>.md
,并告知用户文件位置。

Critical Constraints

关键约束

  • Never skip anomaly discovery because root cause is unclear. Bubble facts are hard conclusions; root cause labels are soft conclusions. Both must always be reported.
  • Never call a high-wait tiny-duration op a real hotspot without checking wait-anchor risk.
  • Never ignore local bubbles just because the step is device-bound overall.
  • Never use only one timing metric — always maintain dual clock accounting (wall_ms + busy_union_ms + kernel_sum_ms + total_cost_ms) at block/side level.
  • Always output
    insufficient_evidence
    or
    possible_untraced_host_blocking
    when host evidence is sparse — never silently omit the anomaly section.
  • Always include raw kernel context around reported bubbles — the kernel names, task types, durations, and stream IDs immediately before and after each bubble gap.
  • Use layered certainty language: declarative for facts,
    possible / probable / insufficient evidence
    for root causes.
  • Always produce the model architecture Markdown report as a separate file — never fold it into the anomaly output alone.
  • Never skip per-layer sub-structure analysis. Every distinct layer type must have its own kernel sequence tree with timing breakdown and stream annotations.
  • Always include the evidence chain table in the architecture report — the reader must see how layer count and pass structure were determined from raw FIA data.
  • Always cross-verify op counts against layer classification before finalizing the architecture report. Discrepancies must be resolved or explicitly noted.
  • 绝不能因根因不明确而跳过异常发现。气泡事实是确凿结论;根因标签是软结论。两者必须始终报告。
  • 绝不能将高等待时间、极小持续时间的op称为真正的热点,必须先检查等待锚点风险。
  • 绝不能仅因step整体受设备限制就忽略局部气泡
  • 绝不能仅使用单一时序指标 — 在block/side层级必须始终维护四种时钟统计(wall_ms + busy_union_ms + kernel_sum_ms + total_cost_ms)。
  • 当主机证据稀疏时,必须输出
    insufficient_evidence
    possible_untraced_host_blocking
    — 绝不能静默省略异常章节。
  • 必须包含报告气泡的原始内核上下文 — 每个气泡间隙前后紧邻的内核名称、任务类型、持续时间和流ID。
  • 使用分层确定性语言:事实用陈述性语言,根因用
    possible / probable / insufficient evidence
    表述。
  • 必须生成独立的模型架构Markdown报告文件 — 绝不能仅将其嵌入异常输出中。
  • 绝不能跳过每层子结构分析。每种不同的层类型必须有自己带时序拆解和流注解的内核序列树。
  • 架构报告中必须包含证据链表 — 读者必须能够看到如何从原始FIA数据确定层数量和pass结构。
  • 在最终确定架构报告前,必须始终交叉验证op数量与层分类是否匹配。差异必须解决或明确记录。

Graceful Degradation

优雅降级策略

Missing dataImpactAction
record_shapes=false
Cannot detect shape variationBubble detection continues; tag
VARIABLE_SHAPE_SAME_TEMPLATE
skipped
with_stack=false
Soft attribution specificity degradesLower confidence; bubble detection unaffected
Sparse host eventsCannot narrow root-cause family
UNTRACED_HOST_BLOCKING_RISK
,
requires_host_followup=true
Capture boundary truncationEdge gaps may be artifacts
PARTIAL_CAPTURE_BOUNDARY
on boundary-adjacent gaps
No
communication.json
Cannot assess comm waitSkip comm overlap, note in evidence gaps. Architecture report omits comm pipeline bandwidth stats but still documents stream roles
No step markersCannot define step windowsFall back to global capture span as single pseudo-step
op_summary
only (no
kernel_details
)
Coarser granularityUse op-level intervals instead; note in limitations. Architecture report uses op counts for layer classification but cannot produce per-layer kernel sequence trees
No FIA kernels detectedCannot determine layer boundaries via FIAArchitecture report falls back to alternative structural markers (e.g., repeating kernel patterns, communication boundaries). Note reduced confidence in layer count
Single forward pass capturedCannot cross-validate pass consistencyArchitecture report documents single pass; notes that cross-pass variation analysis is unavailable
No decode FIA detectedInference-only or prefill-only captureArchitecture report omits decode phase analysis section; notes capture scope limitation
缺失数据影响应对措施
record_shapes=false
无法检测形状变化气泡检测继续;跳过
VARIABLE_SHAPE_SAME_TEMPLATE
标签
with_stack=false
软归因的特异性降低降低置信度;气泡检测不受影响
主机事件稀疏无法缩小根因范围标记
UNTRACED_HOST_BLOCKING_RISK
,设置
requires_host_followup=true
捕获边界截断边缘间隙可能是伪像对边界相邻间隙标记
PARTIAL_CAPTURE_BOUNDARY
communication.json
无法评估通信等待跳过通信重叠分析,在证据间隙中说明。架构报告省略通信流水线带宽统计,但仍记录流用途
无step标记无法定义step窗口退而求其次,将全局捕获范围作为单个伪step
仅存在
op_summary
(无
kernel_details
粒度更粗使用op级间隔替代;在局限性中说明。架构报告使用op数量进行层分类,但无法生成每层内核序列树
未检测到FIA内核无法通过FIA确定层边界架构报告退而使用替代结构标记(如重复内核模式、通信边界)。在层数量部分说明置信度降低
仅捕获单个前向pass无法交叉验证pass一致性架构报告记录单个pass;说明无法进行跨pass差异分析
未检测到解码FIA仅捕获推理预填充或仅预填充架构报告省略解码阶段分析章节;说明捕获范围限制

Output Contract Summary

输出契约总结

Every analysis must produce TWO outputs:
每次分析必须生成两个输出

1. Anomaly Discovery Output

1. 异常发现输出

The
anomaly_discovery
top-level object containing:
enabled
,
dominant_group_id
,
global_device_gap_analysis
,
step_group_anomalies
,
bubble_windows
,
wait_anchor_ops
,
soft_root_cause_summary
,
requires_host_followup
,
confidence
.
Each step result must include bubble metrics, anomaly tags, and soft root-cause labels.
For the full JSON schema, read
references/schema.json
.
anomaly_discovery
顶级对象,包含:
enabled
,
dominant_group_id
,
global_device_gap_analysis
,
step_group_anomalies
,
bubble_windows
,
wait_anchor_ops
,
soft_root_cause_summary
,
requires_host_followup
,
confidence
每个step结果必须包含气泡指标、异常标签和软根因标签。
完整JSON schema请阅读
references/schema.json

2. Model Architecture Report (Markdown file)

2. 模型架构报告(Markdown文件)

A standalone Markdown file saved as
model_architecture_report_<profiling_dir_name>.md
containing all 10 required sections from the architecture template. Read
references/architecture_report_template.md
for the full specification.
The report must include at minimum:
  • Evidence chain proving layer count and pass structure
  • Layer classification table with all distinct layer types
  • Per-layer kernel sequence trees with timing breakdowns for EACH layer type
  • Communication pipeline structure with stream overlap diagram
  • Model architecture ASCII summary diagram and per-pass execution timeline
The architecture report is the primary deliverable for understanding model structure. It must be self-contained — a reader should be able to understand the full model execution without referring to the anomaly output.
独立的Markdown文件,保存为
model_architecture_report_<profiling_dir_name>.md
,包含架构模板中的全部10个必填章节。完整规范请阅读
references/architecture_report_template.md
报告至少必须包含:
  • 证明层数量和pass结构的证据链
  • 包含所有不同层类型的层分类表
  • 每种层类型的带时序拆解的内核序列树
  • 带流重叠图的通信流水线结构
  • ASCII模型架构汇总图和按pass的执行时间线
架构报告是理解模型结构的核心交付物,必须自包含 — 读者无需参考异常输出即可理解完整模型执行逻辑。

Recommendations

建议

Each recommendation must include
scope
(global/step_group/structure/side/op),
followup_required
,
evidence_gap
, and
priority
(P0–P3).
Common follow-up patterns:
  • High bubble + missing stacks → re-profile with
    with_stack=true
  • Missing shapes + unstable grouping →
    record_shapes=true
  • High host risk + low evidence → host-side sampling / thread view
  • High wait pollution → check communication and sync paths
  • Large inter-structure bubble → check host-side layer dispatch latency
每个建议必须包含
scope
(全局/step_group/structure/side/op)、
followup_required
evidence_gap
priority
(P0–P3)。
常见后续处理模式:
  • 高气泡 + 缺失堆栈 → 重新执行性能分析并设置
    with_stack=true
  • 缺失形状信息 + 分组不稳定 → 设置
    record_shapes=true
  • 高主机风险 + 证据不足 → 主机端采样/线程视图分析
  • 高等待污染 → 检查通信和同步路径
  • 大型结构间气泡 → 检查主机端层调度延迟