AscendC Operator Precision Debugging
Locate root causes in five phases from shallow to deep: First analyze data distribution, then review error-prone code points, conduct experimental isolation, and finally use instrumentation for localization.
Phase 1: Error Analysis → Phase 2: Code Review → Phase 3: Experimental Isolation → Phase 4: Instrumentation Localization → Phase 5: Fix Verification
Phase 1: Error Analysis
Principle: Check data first, then code. Clarify "where the error is, how severe it is, and what the error looks like" first.
Collect the shape, dtype, MaxAbsErr/MeanAbsErr/CosineSim of failed cases, then create
csrc/ops/<op_name>/test/debug_<op_name>_precision.py
based on
scripts/debug_precision_template.py
(replace placeholders and run) for automatic analysis:
- Error Statistics: MaxAbsErr, MeanAbsErr, MaxRelErr
- First Error Element: Multi-dimensional coordinates + linear index + NPU value vs reference value
- Error Distribution: Number/proportion of error elements, whether error intervals are periodic
- Special Values: Whether output is all-zero, contains NaN/Inf
- Automatic Comparison: Fixed input vs random input, binary search with reduced shape
Error Characteristics → Preliminary Judgment
| Phenomenon | Most Likely Cause | Next Step |
|---|
| FP16 fails, FP32 passes | Failure to upcast to FP32 for computation | Check Cast in Phase 2 |
| All-zero output | CopyOut not executed / GM offset error | Check CopyOut in Phase 2 |
| Output contains NaN/Inf | Division by zero / log of negative number / overflow | Check Compute in Phase 2 |
| All deviations, CosineSim≈1 | Systematic precision loss | Check upcasting in Phase 2 |
| Periodic/striped errors | Tile boundary / data transfer offset | Conduct experiments in Phase 3 |
| Only tail elements are wrong | Tail tile length / alignment | Check tail tile in Phase 2 |
| Different results in multiple runs | Insufficient asynchronous synchronization | Conduct Experiment B in Phase 3 |
| Small shape passes, large shape fails | Multi-core/tiling boundary | Conduct Experiment A in Phase 3 |
| Fixed input passes, random input fails | Address/stride/offset error | Conduct Experiment C in Phase 3 |
Phase 2: Code Review
MANDATORY: Read
,
, and
(if exists), then troubleshoot from shallow to deep according to the following checklist.
Layer 1: Basic Correctness (Highest Frequency)
Layer 2: Data Transfer and Alignment
Layer 3: Tiling and Multi-core
Layer 4: API Traps
Layer 5: Boundary Cases
Checkpoint: Output review report — list of suspected issues (sorted by possibility). If root cause is locked, jump to Phase 5; otherwise enter Phase 3.
Phase 3: Experimental Isolation
When root cause cannot be directly locked in Phase 2, narrow down the scope through controlled variable experiments. Change only one variable each time.
Experiment A: block_dim → 1 (Multi-core Isolation)
Temporarily hardcode
in op_host, recompile and test. Can be combined with reduced shape.
| Result | Conclusion |
|---|
| Single-core passes, multi-core fails | Inter-core issue: GM interval overlap / tiling mapping / inter-core synchronization |
| Single-core also fails | Non-multi-core issue → Experiment B |
Experiment B: PipeBarrier<PIPE_ALL> (Synchronization Isolation)
Temporarily replace all synchronization in kernel Process with
AscendC::PipeBarrier<PIPE_ALL>()
(add one between CopyIn / Compute / CopyOut).
| Result | Conclusion |
|---|
| Passes after full barrier | Insufficient intra-core synchronization → Gradually restore fine-grained synchronization for localization |
| Still fails | Non-synchronization issue → Experiment C |
is only used for experimental isolation,
never use it as the final solution.
Experiment C: Fixed/Regular Input (Address Isolation)
Test with all-1, arithmetic sequence (
), and random input respectively.
| Result | Conclusion |
|---|
| All-1 passes, arithmetic/random fails | Address/offset/stride error (constant input masks offset issue) |
| All fail | Calculation logic or global tiling error |
| All pass | Precision issue triggered by specific value range → Check boundary/extreme values |
Experiment D: Reduce Shape (Boundary Isolation)
→
→
→ original shape, locate the exact boundary where failure starts, reverse-engineer tile/core boundary.
Reverse-engineering from First Error Index + Tiling
First error linear index → Which tile → Which core → GM start offset of the core → Expected byte count for transfer
Period = tileLength → Transfer/offset issue; Period = vector width → Calculation process issue; Aligned with core boundary → Multi-core/offset issue.
Phase 4: Instrumentation Localization
After narrowing down the issue scope to a certain phase/tile, use
and
for precise localization.
Core Rules
- Print only on core 0: When calculation logic of each core is consistent, add
if (AscendC::GetBlockIdx() == 0)
to reduce output volume.
- Read after synchronization: can only be read after / , otherwise dirty data from unfinished transfer will be read.
- Convert FP16 to float first:
AscendC::printf("v=%.6f\n", static_cast<float>(tensor.GetValue(idx)));
, directly printing half-precision will cause garbled characters.
- Use desc to distinguish phases: desc parameter of DumpTensor (0=after CopyIn, 1=middle of Compute, 2=before CopyOut).
- Start with small amount: Start with small dumpSize for DumpTensor, large size will cause buffer overflow or truncation.
printf vs DumpTensor Selection
| Scenario | Tool |
|---|
| Scalar, branch judgment, single index | |
| Quick scan of a continuous tensor segment | AscendC::DumpTensor(tensor, desc, dumpSize)
|
| Full element-by-element comparison | Do not do inside kernel — Read GM on Host + Python script |
Instrumentation Strategy
Add instrumentation step by step in Compute function after DeQue, compare with manual calculated intermediate results on Python side with the same input. The first step where deviation occurs is the root cause.
cpp
// Example: Core 0, 0th tile
if (AscendC::GetBlockIdx() == 0 && progress == 0) {
AscendC::printf("[step1] tmp[0]=%.6f\n", static_cast<float>(tmp.GetValue(0)));
}
Phase 5: Fix Verification
Common Fix Patterns
| Root Cause | Fix |
|---|
| FP16 not upcast | Add Cast(fp16→fp32) + computation + Cast(fp32→fp16) |
| GM offset error | Correct offset formula (element vs byte) |
| Wrong tail tile length | Use curTileLength for computation/transfer, use tileLength for offset |
| Incorrect tiling parameter | Correct tiling calculation on Host side |
| Missing synchronization | Add correct EnQue/DeQue or PipeBarrier |
| ReduceSum overwrites source | Backup with Adds first then ReduceSum |
| Wrong transfer length | Correct copyLen in DataCopyExtParams |
After Fix
- Remove all debugging instrumentation (printf/DumpTensor), or wrap with
- Recompile and install
- Run original failed case + full precision test
- Still fails → Return to Phase 1 (max 3 rounds), report to user if still fails after 3 rounds
Output Requirements (MANDATORY)
After debugging MUST show in the conversation: Issue Summary, Root Cause Analysis, Fix Content, Verification Result, ≥2 key lessons. NEVER only reply "Fixed".
Typical Cases (Load On Demand)
After locating suspected root cause, load corresponding case to understand complete troubleshooting process:
| Error Phenomenon | Case File | When to Load |
|---|
| FP16 fails, FP32 passes, all deviations | examples/fp16-no-upcast.md
| Suspect missing upcasting |
| First error at tile boundary, period = tileLength | examples/gm-offset-error.md
| Suspect GM offset error |
| Only a few tail elements are wrong | examples/tail-tile-misalign.md
| Suspect tail tile handling |
| block_dim=1 passes, multi-core fails | examples/multicore-tiling-overlap.md
| Suspect inter-core tiling |
| Different results in multiple runs | examples/async-sync-missing.md
| Suspect missing synchronization |
Do not load all cases at once. Only load corresponding case when error characteristics match.
Anti-Patterns (NEVER)
- NEVER modify code directly without analyzing error distribution
- NEVER loop printf full tensor in kernel — Use DumpTensor or Host-side comparison
- NEVER print massively on multiple cores at the same time — Add to print only on core 0
- NEVER read LocalTensor at unsynchronized position — Must be after DeQue/PipeBarrier
- NEVER use as final fix — Only for experimental isolation
- NEVER leave debugging code after fix
- NEVER only fix known failed cases without running full precision test
- NEVER continue trying if fails after more than 3 rounds — Should report to user