Ascend Profiling Anomaly Discovery Skill
Purpose
Analyze Ascend NPU profiling data through three parallel pipelines:
- Structure breakdown: step → structure (layer) → block / side → op → PMU judgement — answers where the time goes.
- Anomaly discovery: step → device busy union → bubble detection → anomaly tags → soft attribution — answers what looks unnatural and where hidden issues may lurk.
- Model architecture analysis: FIA timeline → pass boundaries → layer classification → per-layer sub-structure → communication pipeline → architecture summary — answers what is this model and how does each component execute. Produces a separate Markdown report file.
The core philosophy is separation of concerns: "anomaly exists" is a hard fact derived from device intervals; "why it exists" is a soft attribution that may require additional evidence. Even under weak profiling configurations (no stacks, no shapes, sparse host events), the skill must still reliably surface device idle bubbles and risk labels.
Reference Files — When to Read
Read these before starting analysis:
| File | When to read | What it contains |
|---|
references/kernel_data_guide.md
| Always — read first | Raw data column schemas for kernel_details.csv, op_summary, trace_view.json; the step → structure → block/side → op hierarchy; how to parse, filter, assign kernels at each level; multi-stream handling; per-level timing aggregation |
| Always | Anomaly thresholds, tagging rules, decision tables, soft attribution rules, AICPU classification, wait-anchor rules |
references/architecture_report_template.md
| Always — read before producing the architecture report | Full template for the standalone Markdown architecture report: required sections, formatting rules, analysis techniques for layer classification, communication overlap measurement, per-layer timing breakdowns |
| When producing structured JSON output | Full JSON schema for the output object |
scripts/reference_host_gap_branch.py
| When writing analysis scripts | Reference Python implementation for interval merging, bubble metrics, soft attribution, wait-anchor scoring |
Pipeline Overview
The full state machine:
INGEST → INVENTORY → FACT_EXTRACTION → CANDIDATE_STEP_DETECTION →
STEP_GROUPING → MACRO_STEP_RESOLUTION → SEGMENTATION →
CLOCK_ACCOUNTING → ANOMALY_DISCOVERY → PERF_JUDGEMENT →
RECOMMENDATION → RENDER → DONE
↘
ARCHITECTURE_ANALYSIS → ARCH_REPORT_RENDER → ARCH_REPORT_SAVE
sits after
and before
. It receives already-segmented steps and structures, and runs the bubble detection pipeline on top of them.
runs in parallel with
, using the same segmented data from
plus FIA timeline analysis. It produces a
separate Markdown report file saved alongside the profiling data.
Data Hierarchy: Step → Structure → Block/Side → Op
Understanding how raw kernel data maps to each level is essential. Read
references/kernel_data_guide.md
for the full column schemas and parsing details. Here is the conceptual overview:
Level 0: Raw Kernels
Each row in
represents a single device kernel execution — one invocation of an AI Core task, an AI CPU task, or an HCCL communication task. Key fields:
,
,
,
,
,
,
,
,
.
Level 1: Step
A step is one training/inference iteration. Steps are identified by
user annotations or
markers in
. Each step defines a
service window . All kernels whose start time falls within this window belong to step
.
At step level, compute:
- Total service time, device busy union, underfeed ratio
- Prelaunch gap, tail gap, internal bubbles
- Per-step anomaly tags and soft root-cause labels
Level 2: Structure (Layer)
Within a step, kernels form repeating structures — typically corresponding to model layers (e.g., transformer blocks, attention layers, MLP blocks). Segmentation identifies these by:
- Repeating name-pattern sequences in the kernel timeline
- Significant time gaps between kernel groups
- User annotations marking layer boundaries (if present)
Each structure contains a contiguous span of kernels within the step window.
At structure level, compute:
- Structure wall time, device busy union within structure span
- Structure share of total step time
- Per-structure bubble metrics (is the bubble inside this structure or between structures?)
Level 3: Block / Side
Within each structure, kernels split into:
- Block (main compute path): The dominant chain of AI Core kernels forming the forward or backward pass of this layer. These execute on the main compute stream.
- Side (auxiliary ops): Everything else — small element-wise ops, communication (HCCL), AI CPU fallback ops, memory copies, synchronization events. These may execute on separate streams.
At block/side level, maintain four timing perspectives simultaneously:
- — wall-clock time from first kernel start to last kernel end
- — merged device-busy time (accounts for multi-stream overlap)
- — arithmetic sum of all kernel durations (ignores overlap)
- — sum of for all kernels
Conclusions based on only one metric are incomplete. A block appearing heavy in
but light in
means high stream parallelism. A side appearing heavy in
but light in
means wait-anchor false hotspot risk.
Level 4: Op (individual operator)
The finest grain. Each op may produce one or many device kernels. Op-level analysis handles:
- Top ops by total cost vs. top ops by kernel duration (these rankings often differ)
- Wait-anchor detection: ops with and tiny but high
- AICPU classification: ops running on AI CPU instead of AI Core, classified by
- Small-op initial judgement: whether small individual ops are real inefficiencies or noise
Anomaly Discovery Pipeline (Detail)
Phase 1: BUILD_DEVICE_INTERVALS
For each step, collect all device kernel intervals from
:
device_intervals = []
for each kernel row where Start_Time_us is within step window:
s = max(step_start_us, row.Start_Time_us)
e = min(step_end_us, row.Start_Time_us + row.Duration_us)
if e > s:
device_intervals.append(Interval(s, e))
Key rules:
- Clip kernel intervals to the step window boundary
- Apply communication dedup rules BEFORE interval statistics (see
references/kernel_data_guide.md
section on comm dedup)
- Include AI_CORE, AI_CPU, and HCCL tasks — all count as "device busy"
- When multiple streams exist, intervals from all streams are collected into the same set
Phase 2: MERGE_INTERVALS
Sort intervals by start time, merge overlapping ones:
merged = merge(device_intervals) # see scripts/reference_host_gap_branch.py
busy_union = sum of merged segment durations
This produces the merged busy segments from which all bubble metrics derive.
Phase 3: GAP_CLASSIFICATION
From the merged segments, compute per-step metrics:
- = step window duration
- = sum of merged segment durations
- = service − busy_union
- = underfeed / service
- = start(first_merged_segment) − step_start
- = step_end − end(last_merged_segment)
- = sum of gaps between consecutive merged segments
largest_internal_bubble_ms
= max gap between consecutive merged segments
- = number of inter-segment gaps
Phase 4: HOST_EVIDENCE_COLLECTION
For each bubble window (gap between merged segments, or prelaunch/tail gap), scan the same time range in host events from
:
Host event categories to collect:
- / / — general host activity
- — ACL runtime events
- / / / — sync/copy markers
- / / / / — communication markers
Compute overlap ratios:
host_visible_coverage_ratio
= fraction of bubble covered by any host event
sync_marker_overlap_ratio
= fraction covered by sync/copy markers
comm_marker_overlap_ratio
= fraction covered by communication markers
Phase 5: SOFT_ATTRIBUTION
For each significant bubble, assign probability-level labels based on overlap ratios:
| Condition | Label |
|---|
| sync_overlap ≥ 0.20 | |
| comm_overlap ≥ 0.20 | |
| host_coverage < 0.05 | possible_untraced_host_blocking
|
| host_coverage ≥ 0.10 but no sync/comm dominance | |
| host_parallelism < 1.2 and none of above | possible_python_serialization_or_lock
|
| nothing applies | |
Multiple labels can co-exist. These are explicitly NOT unique root causes.
Phase 6: ANOMALY_TAGGING
Apply the anomaly tags from
decision tables. Core tags:
Risk tags:
,
,
,
WAIT_ANCHOR_FALSE_HOTSPOT
,
,
UNTRACED_HOST_BLOCKING_RISK
,
,
VARIABLE_SHAPE_SAME_TEMPLATE
Phase 7: WAIT_ANCHOR_SCAN
At op level, scan for false hotspots:
wait_ratio = wait_us / (duration_us + wait_us)
if wait_ratio > 0.95 and duration_us < 10.0 and total_cost_rank <= 10:
tag WAIT_ANCHOR_FALSE_HOTSPOT
These ops absorb idle wait time and appear expensive, but their kernel execution is negligible. Demote them in root-cause ranking.
Phase 8: GROUP_AGGREGATION
Aggregate step-level metrics by
:
- Compute avg, median, P90, P95 for each bubble metric
- = true if ≥60% of steps in group have
- = whichever of prelaunch/internal_bubble/tail contributes most
Phase 9: PRODUCE_OUTPUT
Merge anomaly results with structure breakdown. The report MUST include:
Hidden Issue Discovery section
- Dominant step/group bubble metrics (service, busy_union, underfeed_ratio)
- Raw kernel evidence table: for each top bubble window, list the kernel(s) immediately before and after the gap — their names, task types, durations, streams — so the human expert can locate the exact spot in the timeline
- Top 5 bubble windows with timestamps, scope, and host evidence
- Bubble periodicity statistics across the step group
- Host evidence coverage assessment
- Soft root-cause labels with evidence chains
- Follow-up sampling recommendations
Bubble-first Summary (fixed 5-question template)
- Are there significant device idle bubbles?
- Which step type/group do they concentrate in?
- Are they primarily prelaunch / tail / internal / inter-step?
- Is there significant host-originated risk?
- Is evidence sufficient for root cause? If not, say so explicitly.
Structure-Level Bubble Drill-Down
For the dominant step, break down bubble contributions per structure:
- Which structure(s) contain the largest internal bubbles?
- Are bubbles concentrated at structure boundaries (between layers) or within structures?
- For structures with large bubbles, what are the surrounding kernel names and task types?
Model Architecture Analysis Pipeline (Separate Markdown Report)
This pipeline produces a
standalone Markdown report file that documents the model architecture as reverse-engineered from profiling data. Read
references/architecture_report_template.md
for the full template, formatting rules, and analysis techniques.
The report is saved as
model_architecture_report_<profiling_dir_name>.md
in the profiling or output directory.
When to produce the architecture report
Always. Every profiling analysis MUST produce this report alongside the anomaly discovery output. The architecture report provides essential context that makes the anomaly findings interpretable.
Architecture Analysis Phases
Phase A1: FIA_TIMELINE_ANALYSIS
Use FusedInferAttentionScore (FIA) invocations as the primary structural marker:
- Extract all FIA kernels from (match by name containing )
- Sort by
- Classify each FIA as prefill (duration > 10ms) or decode (duration < 1ms)
- Determine pass count:
num_passes = total_FIA / FIA_per_pass
- Identify phase transitions by timestamp gaps between prefill and decode FIA clusters
Phase A2: PASS_BOUNDARY_DETECTION
For each forward pass, determine:
- FIA index range (e.g., #0–#94)
- Time span (absolute timestamps)
- Wall time from first kernel to last kernel
- Average prefill FIA duration
- Total kernel count
Cross-pass variation (FIA duration, wall time) should be noted — it may reveal KV cache growth or memory pressure effects.
Phase A3: LAYER_CLASSIFICATION
For each layer (delimited by consecutive FIA invocations), extract the kernel sequence and classify:
| Classifier kernel | Layer type |
|---|
| No MoE markers (no MoeGatingTopK, no DFC, no GroupedMatmul) | Dense |
| DispatchFFNCombine present | MoE+DFC |
| GroupedMatmul + alltoallv present (no DFC) | MoE+GMM |
| Sampling ops (ArgMax, rejection_*) present | includes decode/sampling logic |
Build a summary table: Layer Type | Layer Range | Count | Characteristics
Phase A4: CROSS_VERIFICATION
Count key ops per pass and verify they match the layer classification:
- FIA count should equal layer count
- DispatchFFNCombine count should match MoE+DFC layer count
- GroupedMatmul count should match MoE+GMM layers + decode layers
- MoeGatingTopK count should match all MoE layers (DFC + GMM + decode)
Discrepancies indicate classification errors — resolve before proceeding.
Phase A5: PER_LAYER_SUBSTRUCTURE
For EACH distinct layer type, analyze the kernel execution sequence:
- Group kernels by functional role (attention, projection, TP comm, MoE routing, expert dispatch, expert FFN, post-MoE norm, next-layer prep, EP comm, sampling)
- Measure wall time per functional group
- Identify which stream each group runs on
- Compute timing breakdown: Component | Wall time | Share of layer
- Note anomalies specific to the layer type (warm-up overhead in layer 0, extra kernels in transition layer, AICPU ops that should be on AI_CORE)
Present as kernel sequence trees with timing annotations and stream labels. See the template for the exact tree notation format.
Phase A6: DECODE_ANALYSIS
Decode layers require separate analysis because they have fundamentally different cost profiles:
- FIA duration drops dramatically (prefill 28ms → decode 0.2ms)
- Communication (especially all-to-all for EP) often dominates
- Expert compute shifts from fused (DFC) to explicit (GroupedMatmul)
Produce a dominant costs table and explain why the cost profile differs from prefill.
Phase A7: COMMUNICATION_PIPELINE
Document the multi-stream overlap strategy:
- Map each stream to its purpose (main compute, AI_CPU comm, HCCL, alltoall)
- Measure overlap ratios between streams
- Draw an ASCII pipeline diagram showing how communication hides behind compute
- Compute what fraction of total communication is hidden (overlapped)
- Explain the kernel_sum >> wall relationship
Phase A8: ARCH_REPORT_RENDER
Assemble all findings into the Markdown report following the template in
references/architecture_report_template.md
. All 10 required sections must be present:
- Configuration Context
- Model Architecture Determination (evidence chain table)
- Forward Pass Boundaries (per-pass table)
- Layer Classification (type table)
- Cross-Verification Table
- Per-Layer Sub-Structure (kernel sequence trees + timing breakdowns for EACH layer type)
- Decode Phase Analysis (dominant costs + prefill vs decode comparison)
- Communication Pipeline Structure (stream table + ASCII pipeline diagram)
- Layer-to-Layer Variation (comparative table)
- Model Architecture Summary (ASCII model diagram + execution timeline)
Phase A9: ARCH_REPORT_SAVE
Save the report as
model_architecture_report_<profiling_dir_name>.md
. Inform the user of the file location.
Critical Constraints
- Never skip anomaly discovery because root cause is unclear. Bubble facts are hard conclusions; root cause labels are soft conclusions. Both must always be reported.
- Never call a high-wait tiny-duration op a real hotspot without checking wait-anchor risk.
- Never ignore local bubbles just because the step is device-bound overall.
- Never use only one timing metric — always maintain dual clock accounting (wall_ms + busy_union_ms + kernel_sum_ms + total_cost_ms) at block/side level.
- Always output or
possible_untraced_host_blocking
when host evidence is sparse — never silently omit the anomaly section.
- Always include raw kernel context around reported bubbles — the kernel names, task types, durations, and stream IDs immediately before and after each bubble gap.
- Use layered certainty language: declarative for facts,
possible / probable / insufficient evidence
for root causes.
- Always produce the model architecture Markdown report as a separate file — never fold it into the anomaly output alone.
- Never skip per-layer sub-structure analysis. Every distinct layer type must have its own kernel sequence tree with timing breakdown and stream annotations.
- Always include the evidence chain table in the architecture report — the reader must see how layer count and pass structure were determined from raw FIA data.
- Always cross-verify op counts against layer classification before finalizing the architecture report. Discrepancies must be resolved or explicitly noted.
Graceful Degradation
| Missing data | Impact | Action |
|---|
| Cannot detect shape variation | Bubble detection continues; tag VARIABLE_SHAPE_SAME_TEMPLATE
skipped |
| Soft attribution specificity degrades | Lower confidence; bubble detection unaffected |
| Sparse host events | Cannot narrow root-cause family | UNTRACED_HOST_BLOCKING_RISK
, requires_host_followup=true
|
| Capture boundary truncation | Edge gaps may be artifacts | on boundary-adjacent gaps |
| No | Cannot assess comm wait | Skip comm overlap, note in evidence gaps. Architecture report omits comm pipeline bandwidth stats but still documents stream roles |
| No step markers | Cannot define step windows | Fall back to global capture span as single pseudo-step |
| only (no ) | Coarser granularity | Use op-level intervals instead; note in limitations. Architecture report uses op counts for layer classification but cannot produce per-layer kernel sequence trees |
| No FIA kernels detected | Cannot determine layer boundaries via FIA | Architecture report falls back to alternative structural markers (e.g., repeating kernel patterns, communication boundaries). Note reduced confidence in layer count |
| Single forward pass captured | Cannot cross-validate pass consistency | Architecture report documents single pass; notes that cross-pass variation analysis is unavailable |
| No decode FIA detected | Inference-only or prefill-only capture | Architecture report omits decode phase analysis section; notes capture scope limitation |
Output Contract Summary
Every analysis must produce TWO outputs:
1. Anomaly Discovery Output
The
top-level object containing:
,
,
global_device_gap_analysis
,
,
,
,
,
,
.
Each step result must include bubble metrics, anomaly tags, and soft root-cause labels.
For the full JSON schema, read
.
2. Model Architecture Report (Markdown file)
A standalone Markdown file saved as
model_architecture_report_<profiling_dir_name>.md
containing all 10 required sections from the architecture template. Read
references/architecture_report_template.md
for the full specification.
The report must include at minimum:
- Evidence chain proving layer count and pass structure
- Layer classification table with all distinct layer types
- Per-layer kernel sequence trees with timing breakdowns for EACH layer type
- Communication pipeline structure with stream overlap diagram
- Model architecture ASCII summary diagram and per-pass execution timeline
The architecture report is the primary deliverable for understanding model structure. It must be self-contained — a reader should be able to understand the full model execution without referring to the anomaly output.
Recommendations
Each recommendation must include
(global/step_group/structure/side/op),
,
, and
(P0–P3).
Common follow-up patterns:
- High bubble + missing stacks → re-profile with
- Missing shapes + unstable grouping →
- High host risk + low evidence → host-side sampling / thread view
- High wait pollution → check communication and sync paths
- Large inter-structure bubble → check host-side layer dispatch latency