ascend-profiling-anomaly

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Ascend Profiling Anomaly Discovery Skill

Ascend性能分析异常发现工具

Purpose

用途

Analyze Ascend NPU profiling data through three parallel pipelines:

Structure breakdown: step → structure (layer) → block / side → op → PMU judgement — answers where the time goes.
Anomaly discovery: step → device busy union → bubble detection → anomaly tags → soft attribution — answers what looks unnatural and where hidden issues may lurk.
Model architecture analysis: FIA timeline → pass boundaries → layer classification → per-layer sub-structure → communication pipeline → architecture summary — answers what is this model and how does each component execute. Produces a separate Markdown report file.

The core philosophy is separation of concerns: "anomaly exists" is a hard fact derived from device intervals; "why it exists" is a soft attribution that may require additional evidence. Even under weak profiling configurations (no stacks, no shapes, sparse host events), the skill must still reliably surface device idle bubbles and risk labels.

通过三个并行流水线分析Ascend NPU性能分析数据：

结构拆解：step → structure（层）→ block / side → op → PMU判定 — 回答时间消耗在何处。
异常发现：step → 设备忙碌区间合并 → 气泡检测 → 异常标签 → 软归因 — 回答哪些现象不符合预期，潜在问题可能出现在哪里。
模型架构分析：FIA时间线 → pass边界 → 层分类 → 每层子结构 → 通信流水线 → 架构总结 — 回答该模型是什么，各组件如何执行。生成独立的Markdown报告文件。

核心理念是关注点分离：“异常存在”是基于设备时间间隔得出的确凿事实；“异常原因”是软归因，可能需要额外证据支撑。即使在弱性能分析配置下（无堆栈、无形状信息、主机事件稀疏），该工具仍需可靠地识别设备空闲气泡并标记风险标签。

Reference Files — When to Read

参考文件 — 阅读时机

Read these before starting analysis:

File	When to read	What it contains
`references/kernel_data_guide.md`	Always — read first	Raw data column schemas for kernel_details.csv, op_summary, trace_view.json; the step → structure → block/side → op hierarchy; how to parse, filter, assign kernels at each level; multi-stream handling; per-level timing aggregation
`references/rulebook.md`	Always	Anomaly thresholds, tagging rules, decision tables, soft attribution rules, AICPU classification, wait-anchor rules
`references/architecture_report_template.md`	Always — read before producing the architecture report	Full template for the standalone Markdown architecture report: required sections, formatting rules, analysis techniques for layer classification, communication overlap measurement, per-layer timing breakdowns
`references/schema.json`	When producing structured JSON output	Full JSON schema for the `anomaly_discovery` output object
`scripts/reference_host_gap_branch.py`	When writing analysis scripts	Reference Python implementation for interval merging, bubble metrics, soft attribution, wait-anchor scoring

开始分析前请阅读以下文件：

文件	阅读时机	内容说明
`references/kernel_data_guide.md`	必须优先阅读	kernel_details.csv、op_summary、trace_view.json的原始数据列 schema；step → structure → block/side → op的层级关系；各层级内核的解析、过滤、分配方法；多流处理；各层级时序聚合规则
`references/rulebook.md`	必须阅读	异常阈值、标签规则、决策表、软归因规则、AICPU分类标准、等待锚点规则
`references/architecture_report_template.md`	生成架构报告前必须阅读	独立Markdown架构报告的完整模板：必填章节、格式规则、层分类分析方法、通信重叠度测量、每层时序拆解技巧
`references/schema.json`	生成结构化JSON输出时阅读	`anomaly_discovery` 输出对象的完整JSON schema
`scripts/reference_host_gap_branch.py`	编写分析脚本时阅读	区间合并、气泡指标、软归因、等待锚点评分的参考Python实现

Pipeline Overview

流水线概述

The full state machine:

INGEST → INVENTORY → FACT_EXTRACTION → CANDIDATE_STEP_DETECTION →
STEP_GROUPING → MACRO_STEP_RESOLUTION → SEGMENTATION →
CLOCK_ACCOUNTING → ANOMALY_DISCOVERY → PERF_JUDGEMENT →
RECOMMENDATION → RENDER → DONE
                                        ↘
                          ARCHITECTURE_ANALYSIS → ARCH_REPORT_RENDER → ARCH_REPORT_SAVE

ANOMALY_DISCOVERY

sits after

CLOCK_ACCOUNTING

and before

PERF_JUDGEMENT

. It receives already-segmented steps and structures, and runs the bubble detection pipeline on top of them.

ARCHITECTURE_ANALYSIS

runs in parallel with

PERF_JUDGEMENT

, using the same segmented data from

CLOCK_ACCOUNTING

plus FIA timeline analysis. It produces a separate Markdown report file saved alongside the profiling data.

完整状态机：

INGEST → INVENTORY → FACT_EXTRACTION → CANDIDATE_STEP_DETECTION →
STEP_GROUPING → MACRO_STEP_RESOLUTION → SEGMENTATION →
CLOCK_ACCOUNTING → ANOMALY_DISCOVERY → PERF_JUDGEMENT →
RECOMMENDATION → RENDER → DONE
                                        ↘
                          ARCHITECTURE_ANALYSIS → ARCH_REPORT_RENDER → ARCH_REPORT_SAVE

ANOMALY_DISCOVERY

位于

CLOCK_ACCOUNTING

之后、

PERF_JUDGEMENT

之前，接收已完成分段的step和structure数据，并在其基础上运行气泡检测流水线。

ARCHITECTURE_ANALYSIS

与

PERF_JUDGEMENT

并行运行，使用

CLOCK_ACCOUNTING

生成的分段数据以及FIA时间线分析结果，生成独立的Markdown报告文件，保存到性能分析数据或输出目录中。

Data Hierarchy: Step → Structure → Block/Side → Op

数据层级：Step → Structure → Block/Side → Op

Understanding how raw kernel data maps to each level is essential. Read

references/kernel_data_guide.md

for the full column schemas and parsing details. Here is the conceptual overview:

理解原始内核数据如何映射到各层级至关重要。请阅读

references/kernel_data_guide.md

获取完整列schema和解析细节。以下是概念性概述：

Level 0: Raw Kernels

层级0：原始内核

Each row in

kernel_details.csv

represents a single device kernel execution — one invocation of an AI Core task, an AI CPU task, or an HCCL communication task. Key fields:

Name

Task Type

Start Time(us)

Duration(us)

Wait Time(us)

Accelerator Core

Stream ID

Input Shapes

Output Shapes

kernel_details.csv

中的每一行代表一次设备内核执行 — AI Core任务、AI CPU任务或HCCL通信任务的一次调用。关键字段：

Name

、

Task Type

、

Start Time(us)

、

Duration(us)

、

Wait Time(us)

、

Accelerator Core

、

Stream ID

、

Input Shapes

、

Output Shapes

。

Level 1: Step

层级1：Step

A step is one training/inference iteration. Steps are identified by

ProfilerStep#N

user annotations or

Iteration#N

markers in

trace_view.json

. Each step defines a service window

[S_i, S_{i+1})

. All kernels whose start time falls within this window belong to step

At step level, compute:

Total service time, device busy union, underfeed ratio
Prelaunch gap, tail gap, internal bubbles
Per-step anomaly tags and soft root-cause labels

一个step代表一次训练/推理迭代。通过

trace_view.json

中的

ProfilerStep#N

用户注解或

Iteration#N

标记识别step。每个step定义一个服务窗口

[S_i, S_{i+1})

。所有启动时间落在该窗口内的内核均属于step

。

在step层级，计算以下指标：

总服务时间、设备忙碌合并时间、馈送不足率
预启动间隙、尾部间隙、内部气泡
每个step的异常标签和软根因标签

Level 2: Structure (Layer)

层级2：Structure（层）

Within a step, kernels form repeating structures — typically corresponding to model layers (e.g., transformer blocks, attention layers, MLP blocks). Segmentation identifies these by:

Repeating name-pattern sequences in the kernel timeline
Significant time gaps between kernel groups
User annotations marking layer boundaries (if present)

Each structure contains a contiguous span of kernels within the step window.

At structure level, compute:

Structure wall time, device busy union within structure span
Structure share of total step time
Per-structure bubble metrics (is the bubble inside this structure or between structures?)

在一个step内，内核形成重复的structures — 通常对应模型层（如Transformer块、注意力层、MLP块）。通过以下方式识别分段：

内核时间线中重复的名称模式序列
内核组之间的显著时间间隙
用户标记的层边界（若存在）

每个structure包含step窗口内一段连续的内核序列。

在structure层级，计算以下指标：

Structure的壁钟时间、structure时间范围内的设备忙碌合并时间
Structure占总step时间的比例
每个structure的气泡指标（气泡位于structure内部还是structure之间？）

Level 3: Block / Side

层级3：Block / Side

Within each structure, kernels split into:

Block (main compute path): The dominant chain of AI Core kernels forming the forward or backward pass of this layer. These execute on the main compute stream.
Side (auxiliary ops): Everything else — small element-wise ops, communication (HCCL), AI CPU fallback ops, memory copies, synchronization events. These may execute on separate streams.

At block/side level, maintain four timing perspectives simultaneously:

```
wall_ms
```
— wall-clock time from first kernel start to last kernel end
```
busy_union_ms
```
— merged device-busy time (accounts for multi-stream overlap)
```
kernel_sum_ms
```
— arithmetic sum of all kernel durations (ignores overlap)
```
total_cost_ms
```
— sum of
```
duration + wait
```
for all kernels

Conclusions based on only one metric are incomplete. A block appearing heavy in

kernel_sum

but light in

wall

means high stream parallelism. A side appearing heavy in

total_cost

but light in

duration

means wait-anchor false hotspot risk.

每个structure内的内核分为两类：

Block（主计算路径）：构成该层前向或反向传播的主导AI Core内核链，在主计算流上执行。
Side（辅助操作）：其他所有操作 — 小型元素级操作、通信（HCCL）、AI CPU回退操作、内存拷贝、同步事件，可能在独立流上执行。

在block/side层级，同时维护四种时序视角：

```
wall_ms
```
— 从第一个内核启动到最后一个内核结束的壁钟时间
```
busy_union_ms
```
— 合并后的设备忙碌时间（考虑多流重叠）
```
kernel_sum_ms
```
— 所有内核持续时间的算术和（忽略重叠）
```
total_cost_ms
```
— 所有内核的
```
duration + wait
```
之和

仅基于单一指标得出的结论是不完整的。若block的

kernel_sum

数值高但

wall

数值低，说明流并行度高；若side的

total_cost

数值高但

duration

数值低，说明存在等待锚点伪热点风险。

Level 4: Op (individual operator)

层级4：Op（单个算子）

The finest grain. Each op may produce one or many device kernels. Op-level analysis handles:

Top ops by total cost vs. top ops by kernel duration (these rankings often differ)
Wait-anchor detection: ops with
```
wait_ratio > 0.95
```
and tiny
```
duration
```
but high
```
total_cost
```
AICPU classification: ops running on AI CPU instead of AI Core, classified by
```
masked_ratio
```
Small-op initial judgement: whether small individual ops are real inefficiencies or noise

最细粒度层级。每个op可能生成一个或多个设备内核。op级分析处理以下内容：

按总成本排序的top op与按内核持续时间排序的top op（两者排名通常不同）
等待锚点检测：
```
wait_ratio > 0.95
```
、
```
duration
```
极小但
```
total_cost
```
高的op
AICPU分类：在AI CPU而非AI Core上运行的op，按
```
masked_ratio
```
分类
小型op初步判定：单个小型op是真正的低效还是噪声

Anomaly Discovery Pipeline (Detail)

异常发现流水线（细节）

Phase 1: BUILD_DEVICE_INTERVALS

阶段1：BUILD_DEVICE_INTERVALS

For each step, collect all device kernel intervals from

kernel_details.csv

device_intervals = []
for each kernel row where Start_Time_us is within step window:
    s = max(step_start_us, row.Start_Time_us)
    e = min(step_end_us, row.Start_Time_us + row.Duration_us)
    if e > s:
        device_intervals.append(Interval(s, e))

Key rules:

Clip kernel intervals to the step window boundary
Apply communication dedup rules BEFORE interval statistics (see
```
references/kernel_data_guide.md
```
section on comm dedup)
Include AI_CORE, AI_CPU, and HCCL tasks — all count as "device busy"
When multiple streams exist, intervals from all streams are collected into the same set

针对每个step，从

kernel_details.csv

收集所有设备内核时间间隔：

device_intervals = []
for each kernel row where Start_Time_us is within step window:
    s = max(step_start_us, row.Start_Time_us)
    e = min(step_end_us, row.Start_Time_us + row.Duration_us)
    if e > s:
        device_intervals.append(Interval(s, e))

关键规则：

将内核时间间隔裁剪到step窗口边界内
在计算间隔统计前应用通信去重规则（参见
```
references/kernel_data_guide.md
```
中通信去重章节）
包含AI_CORE、AI_CPU和HCCL任务 — 全部计入“设备忙碌”
存在多流时，将所有流的间隔收集到同一集合中

Phase 2: MERGE_INTERVALS

阶段2：MERGE_INTERVALS

Sort intervals by start time, merge overlapping ones:

merged = merge(device_intervals)  # see scripts/reference_host_gap_branch.py
busy_union = sum of merged segment durations

This produces the merged busy segments from which all bubble metrics derive.

按启动时间排序间隔，合并重叠部分：

merged = merge(device_intervals)  # 参见scripts/reference_host_gap_branch.py
busy_union = sum of merged segment durations

此步骤生成合并后的忙碌段，所有气泡指标均由此推导。

Phase 3: GAP_CLASSIFICATION

阶段3：GAP_CLASSIFICATION

From the merged segments, compute per-step metrics:

```
service_ms
```
= step window duration
```
device_busy_union_ms
```
= sum of merged segment durations
```
underfeed_ms
```
= service − busy_union
```
underfeed_ratio
```
= underfeed / service
```
prelaunch_gap_ms
```
= start(first_merged_segment) − step_start
```
tail_gap_ms
```
= step_end − end(last_merged_segment)
```
internal_bubble_total_ms
```
= sum of gaps between consecutive merged segments
```
largest_internal_bubble_ms
```
= max gap between consecutive merged segments
```
bubble_count
```
= number of inter-segment gaps

基于合并后的分段，计算每个step的指标：

```
service_ms
```
= step窗口持续时间
```
device_busy_union_ms
```
= 合并分段的持续时间之和
```
underfeed_ms
```
= service − busy_union
```
underfeed_ratio
```
= underfeed / service
```
prelaunch_gap_ms
```
= start(first_merged_segment) − step_start
```
tail_gap_ms
```
= step_end − end(last_merged_segment)
```
internal_bubble_total_ms
```
= 连续合并分段之间的间隙之和
```
largest_internal_bubble_ms
```
= 连续合并分段之间的最大间隙
```
bubble_count
```
= 段间间隙的数量

Phase 4: HOST_EVIDENCE_COLLECTION

阶段4：HOST_EVIDENCE_COLLECTION

For each bubble window (gap between merged segments, or prelaunch/tail gap), scan the same time range in host events from

trace_view.json

Host event categories to collect:

```
cpu_op
```
/
```
python_function
```
/
```
user_annotation
```
— general host activity
```
AscendCL@*
```
— ACL runtime events

HostToDevice

torch_to_npu

aclrtMemcpy*

aclrtSynchronize*

— sync/copy markers

```
c10d
```
/
```
Hccl
```
/
```
hcom
```
/
```
StreamWaitEvent
```
/
```
Notify_Wait
```
— communication markers

Compute overlap ratios:

```
host_visible_coverage_ratio
```
= fraction of bubble covered by any host event
```
sync_marker_overlap_ratio
```
= fraction covered by sync/copy markers
```
comm_marker_overlap_ratio
```
= fraction covered by communication markers

针对每个气泡窗口（合并分段之间的间隙，或预启动/尾部间隙），扫描

trace_view.json

中同一时间范围的主机事件：

需收集的主机事件类别：

```
cpu_op
```
/
```
python_function
```
/
```
user_annotation
```
— 通用主机活动
```
AscendCL@*
```
— ACL运行时事件

HostToDevice

torch_to_npu

aclrtMemcpy*

aclrtSynchronize*

— 同步/拷贝标记

```
c10d
```
/
```
Hccl
```
/
```
hcom
```
/
```
StreamWaitEvent
```
/
```
Notify_Wait
```
— 通信标记

计算重叠率：

```
host_visible_coverage_ratio
```
= 气泡被任意主机事件覆盖的比例
```
sync_marker_overlap_ratio
```
= 气泡被同步/拷贝标记覆盖的比例
```
comm_marker_overlap_ratio
```
= 气泡被通信标记覆盖的比例

Phase 5: SOFT_ATTRIBUTION

阶段5：SOFT_ATTRIBUTION

For each significant bubble, assign probability-level labels based on overlap ratios:

Condition	Label
sync_overlap ≥ 0.20	`possible_sync_or_h2d`
comm_overlap ≥ 0.20	`possible_comm_wait`
host_coverage < 0.05	`possible_untraced_host_blocking`
host_coverage ≥ 0.10 but no sync/comm dominance	`possible_host_launch_lag`
host_parallelism < 1.2 and none of above	`possible_python_serialization_or_lock`
nothing applies	`insufficient_evidence`

Multiple labels can co-exist. These are explicitly NOT unique root causes.

针对每个显著气泡，基于重叠率分配概率级标签：

条件	标签
sync_overlap ≥ 0.20	`possible_sync_or_h2d`
comm_overlap ≥ 0.20	`possible_comm_wait`
host_coverage < 0.05	`possible_untraced_host_blocking`
host_coverage ≥ 0.10 但无同步/通信主导	`possible_host_launch_lag`
host_parallelism < 1.2 且不符合上述条件	`possible_python_serialization_or_lock`
无符合条件项	`insufficient_evidence`

可同时存在多个标签，这些标签并非唯一根因。

Phase 6: ANOMALY_TAGGING

阶段6：ANOMALY_TAGGING

Apply the anomaly tags from

references/rulebook.md

decision tables. Core tags:

Bubble severity:

DEVICE_IDLE_GAP_HEAVY

PRELAUNCH_GAP_HEAVY

TAIL_GAP_HEAVY

INTERNAL_BUBBLE_HEAVY

Risk tags:

HOST_ORIGINATED_RISK

COMM_SYNC_RISK

WAIT_POLLUTION_RISK

WAIT_ANCHOR_FALSE_HOTSPOT

AICPU_EXPOSED_RISK

UNTRACED_HOST_BLOCKING_RISK

PARTIAL_CAPTURE_BOUNDARY

VARIABLE_SHAPE_SAME_TEMPLATE

应用

references/rulebook.md

决策表中的异常标签。核心标签：

气泡严重程度：

DEVICE_IDLE_GAP_HEAVY

PRELAUNCH_GAP_HEAVY

TAIL_GAP_HEAVY

INTERNAL_BUBBLE_HEAVY

风险标签：

HOST_ORIGINATED_RISK

COMM_SYNC_RISK

WAIT_POLLUTION_RISK

WAIT_ANCHOR_FALSE_HOTSPOT

AICPU_EXPOSED_RISK

UNTRACED_HOST_BLOCKING_RISK

PARTIAL_CAPTURE_BOUNDARY

VARIABLE_SHAPE_SAME_TEMPLATE

Phase 7: WAIT_ANCHOR_SCAN

阶段7：WAIT_ANCHOR_SCAN

At op level, scan for false hotspots:

wait_ratio = wait_us / (duration_us + wait_us)
if wait_ratio > 0.95 and duration_us < 10.0 and total_cost_rank <= 10:
    tag WAIT_ANCHOR_FALSE_HOTSPOT

These ops absorb idle wait time and appear expensive, but their kernel execution is negligible. Demote them in root-cause ranking.

在op层级扫描伪热点：

wait_ratio = wait_us / (duration_us + wait_us)
if wait_ratio > 0.95 and duration_us < 10.0 and total_cost_rank <= 10:
    tag WAIT_ANCHOR_FALSE_HOTSPOT

这些op吸收了空闲等待时间，看似开销高昂，但内核执行时间可忽略不计。需在根因排名中降低其优先级。

Phase 8: GROUP_AGGREGATION

阶段8：GROUP_AGGREGATION

Aggregate step-level metrics by

step_group_id

Compute avg, median, P90, P95 for each bubble metric

recurring_bubble_pattern

= true if ≥60% of steps in group have

bubble_count > 0

```
dominant_idle_pattern
```
= whichever of prelaunch/internal_bubble/tail contributes most

按

step_group_id

聚合step级指标：

计算每个气泡指标的平均值、中位数、P90、P95
```
recurring_bubble_pattern
```
= 若≥60%的组内step满足
```
bubble_count > 0
```
，则为true
```
dominant_idle_pattern
```
= 预启动/内部气泡/尾部间隙中贡献最大的类型

Phase 9: PRODUCE_OUTPUT

阶段9：PRODUCE_OUTPUT

Merge anomaly results with structure breakdown. The report MUST include:

合并异常结果与结构拆解数据。报告必须包含：

Hidden Issue Discovery section

潜在问题发现章节

Dominant step/group bubble metrics (service, busy_union, underfeed_ratio)
Raw kernel evidence table: for each top bubble window, list the kernel(s) immediately before and after the gap — their names, task types, durations, streams — so the human expert can locate the exact spot in the timeline
Top 5 bubble windows with timestamps, scope, and host evidence
Bubble periodicity statistics across the step group
Host evidence coverage assessment
Soft root-cause labels with evidence chains
Follow-up sampling recommendations

主导step/组的气泡指标（服务时间、忙碌合并时间、馈送不足率）
原始内核证据表：针对每个top气泡窗口，列出间隙前后紧邻的内核名称、任务类型、持续时间、流ID，以便人工专家定位时间线中的精确位置
前5个气泡窗口的时间戳、范围和主机证据
组内step的气泡周期性统计
主机证据覆盖评估
带证据链的软根因标签
后续采样建议

Bubble-first Summary (fixed 5-question template)

气泡优先总结（固定5问题模板）

Are there significant device idle bubbles?
Which step type/group do they concentrate in?
Are they primarily prelaunch / tail / internal / inter-step?
Is there significant host-originated risk?
Is evidence sufficient for root cause? If not, say so explicitly.

是否存在显著的设备空闲气泡？
气泡集中在哪些step类型/组？
气泡主要是预启动、尾部、内部还是step间类型？
是否存在显著的主机源风险？
根因证据是否充分？若不充分，请明确说明。

Structure-Level Bubble Drill-Down

结构级气泡钻取分析

For the dominant step, break down bubble contributions per structure:

Which structure(s) contain the largest internal bubbles?
Are bubbles concentrated at structure boundaries (between layers) or within structures?
For structures with large bubbles, what are the surrounding kernel names and task types?

针对主导step，按structure拆解气泡贡献：

哪些structure包含最大的内部气泡？
气泡集中在structure边界（层之间）还是structure内部？
对于存在大气泡的structure，其周边内核的名称和任务类型是什么？

Model Architecture Analysis Pipeline (Separate Markdown Report)

模型架构分析流水线（独立Markdown报告）

This pipeline produces a standalone Markdown report file that documents the model architecture as reverse-engineered from profiling data. Read

references/architecture_report_template.md

for the full template, formatting rules, and analysis techniques.

The report is saved as

model_architecture_report_<profiling_dir_name>.md

in the profiling or output directory.

该流水线生成独立的Markdown报告文件，记录基于性能分析数据逆向得出的模型架构。请阅读

references/architecture_report_template.md

获取完整模板、格式规则和分析方法。

报告保存为

model_architecture_report_<profiling_dir_name>.md

，存储在性能分析或输出目录中。

When to produce the architecture report

何时生成架构报告

Always. Every profiling analysis MUST produce this report alongside the anomaly discovery output. The architecture report provides essential context that makes the anomaly findings interpretable.

必须生成。每次性能分析必须同时生成该报告和异常发现输出。架构报告提供必要的上下文，使异常发现结果具备可解释性。

Architecture Analysis Phases

架构分析阶段

Phase A1: FIA_TIMELINE_ANALYSIS

阶段A1：FIA_TIMELINE_ANALYSIS

Use FusedInferAttentionScore (FIA) invocations as the primary structural marker:

Extract all FIA kernels from
```
kernel_details.csv
```
(match by name containing
```
FusedInferAttentionScore
```
)
Sort by
```
Start Time(us)
```
Classify each FIA as prefill (duration > 10ms) or decode (duration < 1ms)
Determine pass count:
```
num_passes = total_FIA / FIA_per_pass
```
Identify phase transitions by timestamp gaps between prefill and decode FIA clusters

使用FusedInferAttentionScore（FIA）调用作为主要结构标记：

从
```
kernel_details.csv
```
提取所有FIA内核（匹配名称包含
```
FusedInferAttentionScore
```
的条目）
按
```
Start Time(us)
```
排序
将每个FIA分类为预填充（持续时间>10ms）或解码（持续时间<1ms）
确定pass数量：
```
num_passes = total_FIA / FIA_per_pass
```
通过预填充和解码FIA集群之间的时间戳间隙识别阶段转换

Phase A2: PASS_BOUNDARY_DETECTION

阶段A2：PASS_BOUNDARY_DETECTION

For each forward pass, determine:

FIA index range (e.g., #0–#94)
Time span (absolute timestamps)
Wall time from first kernel to last kernel
Average prefill FIA duration
Total kernel count

Cross-pass variation (FIA duration, wall time) should be noted — it may reveal KV cache growth or memory pressure effects.

针对每个前向pass，确定：

FIA索引范围（如#0–#94）
时间跨度（绝对时间戳）
从第一个内核到最后一个内核的壁钟时间
平均预填充FIA持续时间
内核总数

需记录跨pass的差异（FIA持续时间、壁钟时间），这可能揭示KV缓存增长或内存压力的影响。

Phase A3: LAYER_CLASSIFICATION

阶段A3：LAYER_CLASSIFICATION

For each layer (delimited by consecutive FIA invocations), extract the kernel sequence and classify:

Classifier kernel	Layer type
No MoE markers (no MoeGatingTopK, no DFC, no GroupedMatmul)	Dense
DispatchFFNCombine present	MoE+DFC
GroupedMatmul + alltoallv present (no DFC)	MoE+GMM
Sampling ops (ArgMax, rejection_*) present	includes decode/sampling logic

Build a summary table: Layer Type | Layer Range | Count | Characteristics

针对每个由连续FIA调用分隔的层，提取内核序列并分类：

分类内核	层类型
无MoE标记（无MoeGatingTopK、无DFC、无GroupedMatmul）	Dense
存在DispatchFFNCombine	MoE+DFC
存在GroupedMatmul + alltoallv（无DFC）	MoE+GMM
存在采样操作（ArgMax、rejection_*）	包含解码/采样逻辑

构建汇总表：层类型 | 层范围 | 数量 | 特征

Phase A4: CROSS_VERIFICATION

阶段A4：CROSS_VERIFICATION

Count key ops per pass and verify they match the layer classification:

FIA count should equal layer count
DispatchFFNCombine count should match MoE+DFC layer count
GroupedMatmul count should match MoE+GMM layers + decode layers
MoeGatingTopK count should match all MoE layers (DFC + GMM + decode)

Discrepancies indicate classification errors — resolve before proceeding.

统计每个pass的关键op数量，验证是否与层分类匹配：

FIA数量应等于层数量
DispatchFFNCombine数量应匹配MoE+DFC层数量
GroupedMatmul数量应匹配MoE+GMM层数量 + 解码层数量
MoeGatingTopK数量应匹配所有MoE层（DFC + GMM + 解码）

若存在差异，说明分类错误，需在继续前解决。

Phase A5: PER_LAYER_SUBSTRUCTURE

阶段A5：PER_LAYER_SUBSTRUCTURE

For EACH distinct layer type, analyze the kernel execution sequence:

Group kernels by functional role (attention, projection, TP comm, MoE routing, expert dispatch, expert FFN, post-MoE norm, next-layer prep, EP comm, sampling)
Measure wall time per functional group
Identify which stream each group runs on
Compute timing breakdown: Component | Wall time | Share of layer
Note anomalies specific to the layer type (warm-up overhead in layer 0, extra kernels in transition layer, AICPU ops that should be on AI_CORE)

Present as kernel sequence trees with timing annotations and stream labels. See the template for the exact tree notation format.

针对每种不同的层类型，分析内核执行序列：

按功能角色（注意力、投影、TP通信、MoE路由、专家调度、专家FFN、MoE后归一化、下一层准备、EP通信、采样）分组内核
测量每个功能组的壁钟时间
识别每个组运行的流
计算时序拆解：组件 | 壁钟时间 | 占层时间比例
记录特定于层类型的异常（第0层的预热开销、过渡层中的额外内核、应在AI_CORE上执行却运行在AICPU的操作）

以带时序注解和流标签的内核序列树形式呈现。请参考模板中的精确树表示格式。

Phase A6: DECODE_ANALYSIS

阶段A6：DECODE_ANALYSIS

Decode layers require separate analysis because they have fundamentally different cost profiles:

FIA duration drops dramatically (prefill 28ms → decode 0.2ms)
Communication (especially all-to-all for EP) often dominates
Expert compute shifts from fused (DFC) to explicit (GroupedMatmul)

Produce a dominant costs table and explain why the cost profile differs from prefill.

解码层需要单独分析，因其成本特征完全不同：

FIA持续时间大幅下降（预填充28ms → 解码0.2ms）
通信（尤其是EP的all-to-all）通常占主导
专家计算从融合（DFC）转为显式（GroupedMatmul）

生成主导成本表，并解释成本特征与预填充不同的原因。

Phase A7: COMMUNICATION_PIPELINE

阶段A7：COMMUNICATION_PIPELINE

Document the multi-stream overlap strategy:

Map each stream to its purpose (main compute, AI_CPU comm, HCCL, alltoall)
Measure overlap ratios between streams
Draw an ASCII pipeline diagram showing how communication hides behind compute
Compute what fraction of total communication is hidden (overlapped)
Explain the kernel_sum >> wall relationship

记录多流重叠策略：

将每个流映射到其用途（主计算、AI_CPU通信、HCCL、alltoall）
测量流之间的重叠率
绘制ASCII流水线图，展示通信如何隐藏在计算背后
计算总通信中被隐藏（重叠）的比例
解释
```
kernel_sum >> wall
```
的关系

Phase A8: ARCH_REPORT_RENDER

阶段A8：ARCH_REPORT_RENDER

Assemble all findings into the Markdown report following the template in

references/architecture_report_template.md

. All 10 required sections must be present:

Configuration Context
Model Architecture Determination (evidence chain table)
Forward Pass Boundaries (per-pass table)
Layer Classification (type table)
Cross-Verification Table
Per-Layer Sub-Structure (kernel sequence trees + timing breakdowns for EACH layer type)
Decode Phase Analysis (dominant costs + prefill vs decode comparison)
Communication Pipeline Structure (stream table + ASCII pipeline diagram)
Layer-to-Layer Variation (comparative table)
Model Architecture Summary (ASCII model diagram + execution timeline)

按照

references/architecture_report_template.md

中的模板，将所有发现整理为Markdown报告。必须包含全部10个必填章节：

配置上下文
模型架构判定（证据链表）
前向Pass边界（按pass统计表格）
层分类（类型表）
交叉验证表
每层子结构（每种层类型的内核序列树 + 时序拆解）
解码阶段分析（主导成本 + 预填充vs解码对比）
通信流水线结构（流表 + ASCII流水线图）
层间差异（对比表）
模型架构总结（ASCII模型图 + 执行时间线）

Phase A9: ARCH_REPORT_SAVE

阶段A9：ARCH_REPORT_SAVE

Save the report as

model_architecture_report_<profiling_dir_name>.md

. Inform the user of the file location.

将报告保存为

model_architecture_report_<profiling_dir_name>.md

，并告知用户文件位置。

Critical Constraints

关键约束

Never skip anomaly discovery because root cause is unclear. Bubble facts are hard conclusions; root cause labels are soft conclusions. Both must always be reported.
Never call a high-wait tiny-duration op a real hotspot without checking wait-anchor risk.
Never ignore local bubbles just because the step is device-bound overall.
Never use only one timing metric — always maintain dual clock accounting (wall_ms + busy_union_ms + kernel_sum_ms + total_cost_ms) at block/side level.
Always output
insufficient_evidence
or
possible_untraced_host_blocking
when host evidence is sparse — never silently omit the anomaly section.
Always include raw kernel context around reported bubbles — the kernel names, task types, durations, and stream IDs immediately before and after each bubble gap.
Use layered certainty language: declarative for facts,
```
possible / probable / insufficient evidence
```
for root causes.
Always produce the model architecture Markdown report as a separate file — never fold it into the anomaly output alone.
Never skip per-layer sub-structure analysis. Every distinct layer type must have its own kernel sequence tree with timing breakdown and stream annotations.
Always include the evidence chain table in the architecture report — the reader must see how layer count and pass structure were determined from raw FIA data.
Always cross-verify op counts against layer classification before finalizing the architecture report. Discrepancies must be resolved or explicitly noted.

绝不能因根因不明确而跳过异常发现。气泡事实是确凿结论；根因标签是软结论。两者必须始终报告。
绝不能将高等待时间、极小持续时间的op称为真正的热点，必须先检查等待锚点风险。
绝不能仅因step整体受设备限制就忽略局部气泡。
绝不能仅使用单一时序指标 — 在block/side层级必须始终维护四种时钟统计（wall_ms + busy_union_ms + kernel_sum_ms + total_cost_ms）。
当主机证据稀疏时，必须输出
insufficient_evidence
或
possible_untraced_host_blocking
— 绝不能静默省略异常章节。
必须包含报告气泡的原始内核上下文 — 每个气泡间隙前后紧邻的内核名称、任务类型、持续时间和流ID。
使用分层确定性语言：事实用陈述性语言，根因用
```
possible / probable / insufficient evidence
```
表述。
必须生成独立的模型架构Markdown报告文件 — 绝不能仅将其嵌入异常输出中。
绝不能跳过每层子结构分析。每种不同的层类型必须有自己带时序拆解和流注解的内核序列树。
架构报告中必须包含证据链表 — 读者必须能够看到如何从原始FIA数据确定层数量和pass结构。
在最终确定架构报告前，必须始终交叉验证op数量与层分类是否匹配。差异必须解决或明确记录。

Graceful Degradation

优雅降级策略

Missing data	Impact	Action
`record_shapes=false`	Cannot detect shape variation	Bubble detection continues; tag `VARIABLE_SHAPE_SAME_TEMPLATE` skipped
`with_stack=false`	Soft attribution specificity degrades	Lower confidence; bubble detection unaffected
Sparse host events	Cannot narrow root-cause family	`UNTRACED_HOST_BLOCKING_RISK` , `requires_host_followup=true`
Capture boundary truncation	Edge gaps may be artifacts	`PARTIAL_CAPTURE_BOUNDARY` on boundary-adjacent gaps
No `communication.json`	Cannot assess comm wait	Skip comm overlap, note in evidence gaps. Architecture report omits comm pipeline bandwidth stats but still documents stream roles
No step markers	Cannot define step windows	Fall back to global capture span as single pseudo-step
`op_summary` only (no `kernel_details` )	Coarser granularity	Use op-level intervals instead; note in limitations. Architecture report uses op counts for layer classification but cannot produce per-layer kernel sequence trees
No FIA kernels detected	Cannot determine layer boundaries via FIA	Architecture report falls back to alternative structural markers (e.g., repeating kernel patterns, communication boundaries). Note reduced confidence in layer count
Single forward pass captured	Cannot cross-validate pass consistency	Architecture report documents single pass; notes that cross-pass variation analysis is unavailable
No decode FIA detected	Inference-only or prefill-only capture	Architecture report omits decode phase analysis section; notes capture scope limitation

缺失数据	影响	应对措施
`record_shapes=false`	无法检测形状变化	气泡检测继续；跳过 `VARIABLE_SHAPE_SAME_TEMPLATE` 标签
`with_stack=false`	软归因的特异性降低	降低置信度；气泡检测不受影响
主机事件稀疏	无法缩小根因范围	标记 `UNTRACED_HOST_BLOCKING_RISK` ，设置 `requires_host_followup=true`
捕获边界截断	边缘间隙可能是伪像	对边界相邻间隙标记 `PARTIAL_CAPTURE_BOUNDARY`
无 `communication.json`	无法评估通信等待	跳过通信重叠分析，在证据间隙中说明。架构报告省略通信流水线带宽统计，但仍记录流用途
无step标记	无法定义step窗口	退而求其次，将全局捕获范围作为单个伪step
仅存在 `op_summary` （无 `kernel_details` ）	粒度更粗	使用op级间隔替代；在局限性中说明。架构报告使用op数量进行层分类，但无法生成每层内核序列树
未检测到FIA内核	无法通过FIA确定层边界	架构报告退而使用替代结构标记（如重复内核模式、通信边界）。在层数量部分说明置信度降低
仅捕获单个前向pass	无法交叉验证pass一致性	架构报告记录单个pass；说明无法进行跨pass差异分析
未检测到解码FIA	仅捕获推理预填充或仅预填充	架构报告省略解码阶段分析章节；说明捕获范围限制

Output Contract Summary

输出契约总结

Every analysis must produce TWO outputs:

每次分析必须生成两个输出：

1. Anomaly Discovery Output

1. 异常发现输出

The

anomaly_discovery

top-level object containing:

enabled

dominant_group_id

global_device_gap_analysis

step_group_anomalies

bubble_windows

wait_anchor_ops

soft_root_cause_summary

requires_host_followup

confidence

Each step result must include bubble metrics, anomaly tags, and soft root-cause labels.

For the full JSON schema, read

references/schema.json

anomaly_discovery

顶级对象，包含：

enabled

dominant_group_id

global_device_gap_analysis

step_group_anomalies

bubble_windows

wait_anchor_ops

soft_root_cause_summary

requires_host_followup

confidence

。

每个step结果必须包含气泡指标、异常标签和软根因标签。

完整JSON schema请阅读

references/schema.json

。

2. Model Architecture Report (Markdown file)

2. 模型架构报告（Markdown文件）

A standalone Markdown file saved as

model_architecture_report_<profiling_dir_name>.md

containing all 10 required sections from the architecture template. Read

references/architecture_report_template.md

for the full specification.

The report must include at minimum:

Evidence chain proving layer count and pass structure
Layer classification table with all distinct layer types
Per-layer kernel sequence trees with timing breakdowns for EACH layer type
Communication pipeline structure with stream overlap diagram
Model architecture ASCII summary diagram and per-pass execution timeline

The architecture report is the primary deliverable for understanding model structure. It must be self-contained — a reader should be able to understand the full model execution without referring to the anomaly output.

独立的Markdown文件，保存为

model_architecture_report_<profiling_dir_name>.md

，包含架构模板中的全部10个必填章节。完整规范请阅读

references/architecture_report_template.md

。

报告至少必须包含：

证明层数量和pass结构的证据链
包含所有不同层类型的层分类表
每种层类型的带时序拆解的内核序列树
带流重叠图的通信流水线结构
ASCII模型架构汇总图和按pass的执行时间线

架构报告是理解模型结构的核心交付物，必须自包含 — 读者无需参考异常输出即可理解完整模型执行逻辑。

Recommendations

建议

Each recommendation must include

scope

(global/step_group/structure/side/op),

followup_required

evidence_gap

, and

priority

(P0–P3).

Common follow-up patterns:

High bubble + missing stacks → re-profile with
```
with_stack=true
```
Missing shapes + unstable grouping →
```
record_shapes=true
```
High host risk + low evidence → host-side sampling / thread view
High wait pollution → check communication and sync paths
Large inter-structure bubble → check host-side layer dispatch latency

每个建议必须包含

scope

（全局/step_group/structure/side/op）、

followup_required

、

evidence_gap

和

priority

（P0–P3）。

常见后续处理模式：

高气泡 + 缺失堆栈 → 重新执行性能分析并设置
```
with_stack=true
```
缺失形状信息 + 分组不稳定 → 设置
```
record_shapes=true
```
高主机风险 + 证据不足 → 主机端采样/线程视图分析
高等待污染 → 检查通信和同步路径
大型结构间气泡 → 检查主机端层调度延迟