ascendc-operator-precision-debug

Original：🇨🇳 Chinese

Translated

1 scriptsChecked / no sensitive code detected

Debugging and Root Cause Localization for AscendC Operator Precision Issues. Used when operator precision tests fail (such as allclose failure, result deviation, all-zero/NaN output, etc.). Process: Error Distribution Analysis → Code Error-Prone Point Review → Experimental Isolation → printf/DumpTensor Instrumentation → Fix Verification. Keywords: precision debugging, precision issue, result inconsistency, error localization, allclose failure, output deviation, NaN, all-zero, precision debug.

8installs

Sourceascend/agent-skills

Added on2026-04-15

NPX Install

npx skill4agent add ascend/agent-skills ascendc-operator-precision-debug

SKILL.md Content (Chinese)

View Translation Comparison →

AscendC Operator Precision Debugging

Locate root causes in five phases from shallow to deep: First analyze data distribution, then review error-prone code points, conduct experimental isolation, and finally use instrumentation for localization.

Phase 1: Error Analysis → Phase 2: Code Review → Phase 3: Experimental Isolation → Phase 4: Instrumentation Localization → Phase 5: Fix Verification

Phase 1: Error Analysis

Principle: Check data first, then code. Clarify "where the error is, how severe it is, and what the error looks like" first.

Collect the shape, dtype, MaxAbsErr/MeanAbsErr/CosineSim of failed cases, then create

csrc/ops/<op_name>/test/debug_<op_name>_precision.py

based on

scripts/debug_precision_template.py

(replace placeholders and run) for automatic analysis:

Error Statistics: MaxAbsErr, MeanAbsErr, MaxRelErr
First Error Element: Multi-dimensional coordinates + linear index + NPU value vs reference value
Error Distribution: Number/proportion of error elements, whether error intervals are periodic
Special Values: Whether output is all-zero, contains NaN/Inf
Automatic Comparison: Fixed input vs random input, binary search with reduced shape

Error Characteristics → Preliminary Judgment

Phenomenon	Most Likely Cause	Next Step
FP16 fails, FP32 passes	Failure to upcast to FP32 for computation	Check Cast in Phase 2
All-zero output	CopyOut not executed / GM offset error	Check CopyOut in Phase 2
Output contains NaN/Inf	Division by zero / log of negative number / overflow	Check Compute in Phase 2
All deviations, CosineSim≈1	Systematic precision loss	Check upcasting in Phase 2
Periodic/striped errors	Tile boundary / data transfer offset	Conduct experiments in Phase 3
Only tail elements are wrong	Tail tile length / alignment	Check tail tile in Phase 2
Different results in multiple runs	Insufficient asynchronous synchronization	Conduct Experiment B in Phase 3
Small shape passes, large shape fails	Multi-core/tiling boundary	Conduct Experiment A in Phase 3
Fixed input passes, random input fails	Address/stride/offset error	Conduct Experiment C in Phase 3

Phase 2: Code Review

MANDATORY: Read

op_host/<op_name>.cpp

op_kernel/<op_name>.cpp

, and

design.md

(if exists), then troubleshoot from shallow to deep according to the following checklist.

Layer 1: Basic Correctness (Highest Frequency)

FP16/BF16 not upcast: In Compute, does half-precision data first
```
Cast
```
to FP32 for computation then
```
Cast
```
back? This is the most frequent precision bug.
Incorrect calculation formula: Compare API call sequence with design document/PyTorch step by step — operation order, scalar sign, missing steps.
GM offset unit confusion:
```
xGm[progress * tileLength]
```
is element offset, do not multiply by
```
sizeof(T)
```
extra.
tileLength vs curTileLength: Use
```
tileLength
```
for offset, use
```
curTileLength
```
for computation/transfer (tail tile may be smaller).

Layer 2: Data Transfer and Alignment

DataCopyPad copyLen:

copyLen

DataCopyExtParams

is byte count =

curTileLength * sizeof(T)

Tail tile alignment: When tail tile does not meet 32B alignment, is the calculation and usage of
```
alignedTailLen
```
correct?
Inconsistent offsets for multiple inputs: When shapes of multiple input tensors are different (such as x vs cos/sin in RoPE), is the offset calculation for each correct?

Layer 3: Tiling and Multi-core

Host/Kernel tiling inconsistency: Does the same symbol (e.g.,
```
tileLength
```
) have the same meaning in host and kernel?
Inter-core boundary overlap/omission: Does formerNum × formerLength + tailNum × tailLength exactly cover all data?
bufferCoefficient error: Check against the UB allocation table in design document, incorrect coefficient will cause tileLength deviation.

Layer 4: API Traps

ReduceSum/Max modifies source data: Reduction may rewrite source tensor, need to backup first with
```
Adds(backup, src, 0.0f, len)
```
if reused later.
AllocTensor/FreeTensor not paired: Must be strictly paired with EnQue/DeQue, otherwise buffer leakage occurs.
Vector length parameter: The length parameter of AscendC vector API is number of elements, not byte count.

Layer 5: Boundary Cases

Division by zero / domain out of bounds: Prevent zero for Div, Reciprocal; Ln requires positive numbers; Sqrt requires non-negative numbers.
Tiling integer overflow: Is multiplication likely to overflow int32? int64_t is recommended.

Checkpoint: Output review report — list of suspected issues (sorted by possibility). If root cause is locked, jump to Phase 5; otherwise enter Phase 3.

Phase 3: Experimental Isolation

When root cause cannot be directly locked in Phase 2, narrow down the scope through controlled variable experiments. Change only one variable each time.

Experiment A: block_dim → 1 (Multi-core Isolation)

Temporarily hardcode

blockDim = 1

in op_host, recompile and test. Can be combined with reduced shape.

Result	Conclusion
Single-core passes, multi-core fails	Inter-core issue: GM interval overlap / tiling mapping / inter-core synchronization
Single-core also fails	Non-multi-core issue → Experiment B

Experiment B: PipeBarrier<PIPE_ALL> (Synchronization Isolation)

Temporarily replace all synchronization in kernel Process with

AscendC::PipeBarrier<PIPE_ALL>()

(add one between CopyIn / Compute / CopyOut).

Result	Conclusion
Passes after full barrier	Insufficient intra-core synchronization → Gradually restore fine-grained synchronization for localization
Still fails	Non-synchronization issue → Experiment C

PIPE_ALL
is only used for experimental isolation, never use it as the final solution.

Experiment C: Fixed/Regular Input (Address Isolation)

Test with all-1, arithmetic sequence (

torch.arange

), and random input respectively.

Result	Conclusion
All-1 passes, arithmetic/random fails	Address/offset/stride error (constant input masks offset issue)
All fail	Calculation logic or global tiling error
All pass	Precision issue triggered by specific value range → Check boundary/extreme values

Experiment D: Reduce Shape (Boundary Isolation)

shape=(32,)

→

(tileLength,)

→

(tileLength*2,)

→ original shape, locate the exact boundary where failure starts, reverse-engineer tile/core boundary.

Reverse-engineering from First Error Index + Tiling

First error linear index → Which tile → Which core → GM start offset of the core → Expected byte count for transfer

Period = tileLength → Transfer/offset issue; Period = vector width → Calculation process issue; Aligned with core boundary → Multi-core/offset issue.

Phase 4: Instrumentation Localization

After narrowing down the issue scope to a certain phase/tile, use

AscendC::printf

and

AscendC::DumpTensor

for precise localization.

Core Rules

Print only on core 0: When calculation logic of each core is consistent, add
```
if (AscendC::GetBlockIdx() == 0)
```
to reduce output volume.
Read after synchronization:
```
LocalTensor
```
can only be read after
```
DeQue
```
/
```
PipeBarrier
```
, otherwise dirty data from unfinished transfer will be read.
Convert FP16 to float first:
```
AscendC::printf("v=%.6f\n", static_cast<float>(tensor.GetValue(idx)));
```
, directly printing half-precision will cause garbled characters.
Use desc to distinguish phases: desc parameter of DumpTensor (0=after CopyIn, 1=middle of Compute, 2=before CopyOut).
Start with small amount: Start with small dumpSize for DumpTensor, large size will cause buffer overflow or truncation.

printf vs DumpTensor Selection

Scenario	Tool
Scalar, branch judgment, single index	`AscendC::printf`
Quick scan of a continuous tensor segment	`AscendC::DumpTensor(tensor, desc, dumpSize)`
Full element-by-element comparison	Do not do inside kernel — Read GM on Host + Python script

Instrumentation Strategy

Add instrumentation step by step in Compute function after DeQue, compare with manual calculated intermediate results on Python side with the same input. The first step where deviation occurs is the root cause.

cpp

// Example: Core 0, 0th tile
if (AscendC::GetBlockIdx() == 0 && progress == 0) {
    AscendC::printf("[step1] tmp[0]=%.6f\n", static_cast<float>(tmp.GetValue(0)));
}

Phase 5: Fix Verification

Common Fix Patterns

Root Cause	Fix
FP16 not upcast	Add Cast(fp16→fp32) + computation + Cast(fp32→fp16)
GM offset error	Correct offset formula (element vs byte)
Wrong tail tile length	Use curTileLength for computation/transfer, use tileLength for offset
Incorrect tiling parameter	Correct tiling calculation on Host side
Missing synchronization	Add correct EnQue/DeQue or PipeBarrier
ReduceSum overwrites source	Backup with Adds first then ReduceSum
Wrong transfer length	Correct copyLen in DataCopyExtParams

After Fix

Remove all debugging instrumentation (printf/DumpTensor), or wrap with
```
#ifdef DEBUG_PRECISION
```
Recompile and install
Run original failed case + full precision test
Still fails → Return to Phase 1 (max 3 rounds), report to user if still fails after 3 rounds

Output Requirements (MANDATORY)

After debugging MUST show in the conversation: Issue Summary, Root Cause Analysis, Fix Content, Verification Result, ≥2 key lessons. NEVER only reply "Fixed".

Typical Cases (Load On Demand)

After locating suspected root cause, load corresponding case to understand complete troubleshooting process:

Error Phenomenon	Case File	When to Load
FP16 fails, FP32 passes, all deviations	`examples/fp16-no-upcast.md`	Suspect missing upcasting
First error at tile boundary, period = tileLength	`examples/gm-offset-error.md`	Suspect GM offset error
Only a few tail elements are wrong	`examples/tail-tile-misalign.md`	Suspect tail tile handling
block_dim=1 passes, multi-core fails	`examples/multicore-tiling-overlap.md`	Suspect inter-core tiling
Different results in multiple runs	`examples/async-sync-missing.md`	Suspect missing synchronization

Do not load all cases at once. Only load corresponding case when error characteristics match.

Anti-Patterns (NEVER)

NEVER modify code directly without analyzing error distribution
NEVER loop printf full tensor in kernel — Use DumpTensor or Host-side comparison
NEVER print massively on multiple cores at the same time — Add
```
GetBlockIdx() == 0
```
to print only on core 0
NEVER read LocalTensor at unsynchronized position — Must be after DeQue/PipeBarrier
NEVER use
```
PIPE_ALL
```
as final fix — Only for experimental isolation
NEVER leave debugging code after fix
NEVER only fix known failed cases without running full precision test
NEVER continue trying if fails after more than 3 rounds — Should report to user