Loading...
Loading...
Compare original and translation side by side
Phase 1: 误差分析 → Phase 2: 代码审查 → Phase 3: 实验隔离 → Phase 4: 插桩定位 → Phase 5: 修复验证Phase 1: Error Analysis → Phase 2: Code Review → Phase 3: Experimental Isolation → Phase 4: Instrumentation Localization → Phase 5: Fix Verificationscripts/debug_precision_template.pycsrc/ops/<op_name>/test/debug_<op_name>_precision.pycsrc/ops/<op_name>/test/debug_<op_name>_precision.pyscripts/debug_precision_template.py| 现象 | 最可能原因 | 下一步 |
|---|---|---|
| FP16 失败,FP32 通过 | 未升精度到 FP32 计算 | Phase 2 查 Cast |
| 输出全零 | CopyOut 未执行 / GM 偏移错 | Phase 2 查 CopyOut |
| 输出含 NaN/Inf | 除零 / log 负数 / 溢出 | Phase 2 查 Compute |
| 全部偏差,CosineSim≈1 | 系统性精度损失 | Phase 2 查升精度 |
| 周期性/条纹状错误 | tile 边界 / 搬运偏移 | Phase 3 实验 |
| 仅尾部元素错 | 尾 tile 长度 / 对齐 | Phase 2 查尾 tile |
| 多次运行结果不同 | 异步同步不足 | Phase 3 实验 B |
| 小 shape 过、大 shape 挂 | 多核/tiling 边界 | Phase 3 实验 A |
| 固定输入过、随机挂 | 地址/stride/偏移错 | Phase 3 实验 C |
| Phenomenon | Most Likely Cause | Next Step |
|---|---|---|
| FP16 fails, FP32 passes | Failure to upcast to FP32 for computation | Check Cast in Phase 2 |
| All-zero output | CopyOut not executed / GM offset error | Check CopyOut in Phase 2 |
| Output contains NaN/Inf | Division by zero / log of negative number / overflow | Check Compute in Phase 2 |
| All deviations, CosineSim≈1 | Systematic precision loss | Check upcasting in Phase 2 |
| Periodic/striped errors | Tile boundary / data transfer offset | Conduct experiments in Phase 3 |
| Only tail elements are wrong | Tail tile length / alignment | Check tail tile in Phase 2 |
| Different results in multiple runs | Insufficient asynchronous synchronization | Conduct Experiment B in Phase 3 |
| Small shape passes, large shape fails | Multi-core/tiling boundary | Conduct Experiment A in Phase 3 |
| Fixed input passes, random input fails | Address/stride/offset error | Conduct Experiment C in Phase 3 |
op_host/<op_name>.cppop_kernel/<op_name>.cppdesign.mdop_host/<op_name>.cppop_kernel/<op_name>.cppdesign.mdCastCastxGm[progress * tileLength]sizeof(T)tileLengthcurTileLengthCastCastxGm[progress * tileLength]sizeof(T)tileLengthcurTileLengthDataCopyExtParamscurTileLength * sizeof(T)alignedTailLencopyLenDataCopyExtParamscurTileLength * sizeof(T)alignedTailLentileLengthtileLengthAdds(backup, src, 0.0f, len)Adds(backup, src, 0.0f, len)blockDim = 1| 结果 | 结论 |
|---|---|
| 单核过、多核挂 | 核间问题:GM 区间重叠 / tiling 映射 / 核间同步 |
| 单核也挂 | 非多核问题 → 实验 B |
blockDim = 1| Result | Conclusion |
|---|---|
| Single-core passes, multi-core fails | Inter-core issue: GM interval overlap / tiling mapping / inter-core synchronization |
| Single-core also fails | Non-multi-core issue → Experiment B |
AscendC::PipeBarrier<PIPE_ALL>()| 结果 | 结论 |
|---|---|
| 全屏障后过 | 核内同步不足 → 逐步恢复细粒度同步定位 |
| 仍失败 | 非同步问题 → 实验 C |
仅用于实验隔离,绝不可作为最终方案。PIPE_ALL
AscendC::PipeBarrier<PIPE_ALL>()| Result | Conclusion |
|---|---|
| Passes after full barrier | Insufficient intra-core synchronization → Gradually restore fine-grained synchronization for localization |
| Still fails | Non-synchronization issue → Experiment C |
is only used for experimental isolation, never use it as the final solution.PIPE_ALL
torch.arange| 结果 | 结论 |
|---|---|
| 全 1 过、等差/随机挂 | 地址/偏移/stride 错误(常数输入掩盖了偏移问题) |
| 全都挂 | 计算逻辑或全局 tiling 错误 |
| 全都过 | 特定数值范围触发精度问题 → 查边界值/极值 |
torch.arange| Result | Conclusion |
|---|---|
| All-1 passes, arithmetic/random fails | Address/offset/stride error (constant input masks offset issue) |
| All fail | Calculation logic or global tiling error |
| All pass | Precision issue triggered by specific value range → Check boundary/extreme values |
shape=(32,)(tileLength,)(tileLength*2,)shape=(32,)(tileLength,)(tileLength*2,)首错线性下标 → 第几个 tile → 哪个核 → 该核 GM 起始偏移 → 搬运预期字节数First error linear index → Which tile → Which core → GM start offset of the core → Expected byte count for transferAscendC::printfAscendC::DumpTensorAscendC::printfAscendC::DumpTensorif (AscendC::GetBlockIdx() == 0)DeQuePipeBarrierLocalTensorAscendC::printf("v=%.6f\n", static_cast<float>(tensor.GetValue(idx)));if (AscendC::GetBlockIdx() == 0)LocalTensorDeQuePipeBarrierAscendC::printf("v=%.6f\n", static_cast<float>(tensor.GetValue(idx)));| 场景 | 工具 |
|---|---|
| 标量、分支判断、单个下标 | |
| 连续一段 tensor 快速扫 | |
| 全量逐元素对比 | 不在 kernel 内做 — Host 读 GM + Python 脚本 |
| Scenario | Tool |
|---|---|
| Scalar, branch judgment, single index | |
| Quick scan of a continuous tensor segment | |
| Full element-by-element comparison | Do not do inside kernel — Read GM on Host + Python script |
// 示意:0 核、第 0 个 tile
if (AscendC::GetBlockIdx() == 0 && progress == 0) {
AscendC::printf("[step1] tmp[0]=%.6f\n", static_cast<float>(tmp.GetValue(0)));
}// Example: Core 0, 0th tile
if (AscendC::GetBlockIdx() == 0 && progress == 0) {
AscendC::printf("[step1] tmp[0]=%.6f\n", static_cast<float>(tmp.GetValue(0)));
}| 根因 | 修复 |
|---|---|
| FP16 未升精度 | 添加 Cast(fp16→fp32) + 计算 + Cast(fp32→fp16) |
| GM 偏移错 | 修正偏移公式(元素 vs 字节) |
| 尾 tile 长度错 | 计算/搬运用 curTileLength,偏移用 tileLength |
| tiling 参数错 | 修正 host 端 tiling 计算 |
| 同步缺失 | 添加正确的 EnQue/DeQue 或 PipeBarrier |
| ReduceSum 覆盖源 | 先 Adds 备份再 ReduceSum |
| 搬运长度错 | 修正 DataCopyExtParams 的 copyLen |
| Root Cause | Fix |
|---|---|
| FP16 not upcast | Add Cast(fp16→fp32) + computation + Cast(fp32→fp16) |
| GM offset error | Correct offset formula (element vs byte) |
| Wrong tail tile length | Use curTileLength for computation/transfer, use tileLength for offset |
| Incorrect tiling parameter | Correct tiling calculation on Host side |
| Missing synchronization | Add correct EnQue/DeQue or PipeBarrier |
| ReduceSum overwrites source | Backup with Adds first then ReduceSum |
| Wrong transfer length | Correct copyLen in DataCopyExtParams |
#ifdef DEBUG_PRECISION#ifdef DEBUG_PRECISION| 误差现象 | 案例文件 | 何时加载 |
|---|---|---|
| FP16 挂 FP32 过,全部偏差 | | 怀疑升精度缺失 |
| 首错在 tile 边界,周期 = tileLength | | 怀疑 GM 偏移错误 |
| 仅尾部少量元素错 | | 怀疑尾 tile 处理 |
| block_dim=1 过,多核挂 | | 怀疑核间 tiling |
| 多次运行结果不同 | | 怀疑同步缺失 |
不要一次性加载所有案例。 仅在误差特征匹配时加载对应案例。
| Error Phenomenon | Case File | When to Load |
|---|---|---|
| FP16 fails, FP32 passes, all deviations | | Suspect missing upcasting |
| First error at tile boundary, period = tileLength | | Suspect GM offset error |
| Only a few tail elements are wrong | | Suspect tail tile handling |
| block_dim=1 passes, multi-core fails | | Suspect inter-core tiling |
| Different results in multiple runs | | Suspect missing synchronization |
Do not load all cases at once. Only load corresponding case when error characteristics match.
GetBlockIdx() == 0PIPE_ALLGetBlockIdx() == 0PIPE_ALL