motoko-performance-optimizations
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMotoko Performance Optimizations
Motoko 性能优化
What This Is
内容简介
An extensible guide for speeding up Motoko code safely and predictably. It focuses on mechanical, behavior-preserving improvements: allocation reduction, fixed-width arithmetic, block processing, efficient building, and clear loop shapes. Use this skill when you want to improve throughput/latency without changing semantics.
Text- Benchmarking details and harnesses live in: skills/benchmarks-generation/SKILL.md
- Style and safe refactors that often precede perf work: skills/code-improvements/SKILL.md
- Dot-notation improvements that reduce verbosity/overhead: skills/dot-notation-migration/SKILL.md
这是一份可扩展的指南,用于安全且可预测地提升Motoko代码的运行速度。它专注于机械性、不改变行为的优化:减少内存分配、固定宽度算术、块处理、高效构建,以及清晰的循环结构。当你希望在不改变语义的前提下提升吞吐量或降低延迟时,可以使用本技巧。
Text- 基准测试细节和测试工具位于:skills/benchmarks-generation/SKILL.md
- 通常在性能优化前进行的代码风格与安全重构:skills/code-improvements/SKILL.md
- 减少冗余与开销的点符号优化:skills/dot-notation-migration/SKILL.md
Quick Wins (General)
通用快速优化方案
-
Minimize allocations in hot paths
- Avoid materializing entire buffers just to iterate (e.g., prefer indexing directly).
Blob - Reuse lengths and capacities; cache calls to a local variable.
size()
- Avoid materializing entire buffers just to iterate (e.g., prefer indexing
-
Prefer fixed-width arithmetic in tight loops
- Keep shifts/masks on /
Nat32intermediates; avoid widening toNat64mid-loop.Nat
- Keep shifts/masks on
-
Buildin larger chunks
Text- Avoid per-character + many small
Text.fromCharappends; emit ASCII/UTF‑8 blocks and append once per block.#
- Avoid per-character
-
Shape loops for the steady state
- Process uniform blocks in the main loop; handle a tiny tail separately.
- Hoist invariants (sizes, constants) and keep the inner loop straight-line.
-
Verify after each change
- Rebuild, run unit tests/property checks, then benchmark representative inputs.
-
减少热点路径中的内存分配
- 避免为了迭代而将整个缓冲区实例化(例如,优先直接索引)。
Blob - 复用长度和容量;将调用结果缓存到局部变量中。
size()
- 避免为了迭代而将整个缓冲区实例化(例如,优先直接索引
-
在紧凑循环中优先使用固定宽度算术
- 对/
Nat32中间值执行移位/掩码操作;避免在循环中途将值扩展为Nat64类型。Nat
- 对
-
以更大的块构建
Text- 避免逐个字符调用并进行多次小范围
Text.fromChar拼接;输出ASCII/UTF‑8块,每个块仅拼接一次。#
- 避免逐个字符调用
-
为稳定状态设计循环结构
- 在主循环中处理统一的块;单独处理少量剩余数据。
- 将不变量(大小、常量)提升到循环外,保持内部循环为直线代码。
-
每次修改后进行验证
- 重新构建代码,运行单元测试/属性检查,然后针对代表性输入进行基准测试。
Technique Areas Overview
技术领域概述
- Byte/Bit Hot‑Path Techniques (detailed below)
- Allocation & Memory Management (coming soon)
- Text/String Building Patterns (coming soon)
- Numeric/Arithmetic & Fixed‑Width Ops (coming soon)
- Data Structures & Algorithms (coming soon)
- Async/Await & Inter‑canister Patterns (coming soon)
- Candid/Serialization Efficiency (coming soon)
- Caching & Memoization (coming soon)
- 字节/位热点路径技术(下文详细介绍)
- 内存分配与内存管理(即将推出)
- 文本/字符串构建模式(即将推出)
- 数值/算术与固定宽度操作(即将推出)
- 数据结构与算法(即将推出)
- Async/Await与跨容器模式(即将推出)
- Candid/序列化效率(即将推出)
- 缓存与记忆化(即将推出)
Related Skills
相关技巧
- Benchmarks: skills/benchmarks-generation/SKILL.md
- Code Quality Cleanup: skills/code-improvements/SKILL.md
- Dot‑notation Migration: skills/dot-notation-migration/SKILL.md
- General Style: skills/motoko-general-style-guidelines/SKILL.md
- 基准测试:skills/benchmarks-generation/SKILL.md
- 代码质量清理:skills/code-improvements/SKILL.md
- 点符号迁移:skills/dot-notation-migration/SKILL.md
- 通用风格指南:skills/motoko-general-style-guidelines/SKILL.md
Byte/Bit Hot‑Path Techniques
字节/位热点路径技术
These techniques target code that iterates over or , performs bit packing/unpacking, or emits (e.g., encoders/decoders, checksums, binary parsers). They reduce allocations, avoid expensive integer widening, and reshape loops for better throughput while preserving correctness.
Blob[Nat8]Text这些技术针对遍历或、执行位打包/解包或生成的代码(例如编码器/解码器、校验和、二进制解析器)。它们在保证正确性的前提下,减少内存分配、避免昂贵的整数扩展,并重构循环以提升吞吐量。
Blob[Nat8]TextWhen To Use
适用场景
- You see hot loops over bytes and bit operations
- Code constructs character-by-character or via many small concatenations
Text - is converted to
Blobonly to iterate or index[Nat8]
- 代码中存在遍历字节和位操作的热点循环
- 代码逐个字符或通过多次小范围拼接构建
Text - 仅为了迭代或索引而转换为
Blob[Nat8]
Prerequisites
前置条件
- Motoko compiler (moc) at a reasonably recent version (1.3.0+ recommended)
- Familiarity with fixed-width integer types (,
Nat8,Nat16,Nat32) and wrapping arithmetic (Nat64)+% - For benchmarking guidance, see: skills/benchmarks-generation/SKILL.md
- 使用版本较新的Motoko编译器(moc),推荐1.3.0及以上版本
- 熟悉固定宽度整数类型(、
Nat8、Nat16、Nat32)和包装算术(Nat64)+% - 如需基准测试指导,请参考:skills/benchmarks-generation/SKILL.md
Quick Wins Checklist (Heuristics)
快速优化检查清单(启发式规则)
Scan candidate code for these patterns and replace accordingly:
-
Blob handling
- If you see just to iterate or index → Prefer indexing
let bytes = Blob.toArray(data)directly:Blob.data[i] - Cache to a local; choose an appropriate width (often
data.size()).Nat64
- If you see
-
Integer conversions & bit ops
- Avoid widening (arbitrary precision) for bit operations.
Nat8 -> Nat - Use staged widening on fixed-width types: before shifts and masks.
b.toNat16().toNat32() - Perform ,
<<,>>,&on|/Nat32intermediates, then narrow at the end if needed.Nat64 - Use wrapping arithmetic for loop indices when overflow cannot occur by invariant; prefer a wide counter (e.g.,
+%).Nat64
- Avoid widening
-
Alphabets / lookup tables
- Avoid /
Chartables in hot paths. Store alphabets asText(UTF-8/ASCII codes) for direct byte emission.[Nat8]
- Avoid
-
Text construction
- Avoid per-character plus repeated
Text.fromCharconcatenations.# - Emit text in blocks: build a small ASCII (e.g., 4–8 bytes),
Blobonce, then append once per block.Text.decodeUtf8 - Minimize operations; 1 append per 4–8 output chars is far cheaper than many appends.
#
- Avoid per-character
-
Loop shape
- Process input in uniform blocks (e.g., 3 bytes → 4 chars), then handle a small tail (0–2 bytes) separately.
- Hoist invariants (sizes, constants); keep inner loops straight-line.
-
Correctness guards
- For theoretically unreachable states (e.g., ASCII decode failure), insert a with a clear message.
Prim.trap("…") - Double-check off-by-one errors at block boundaries (e.g., ).
next_i <= sz
- For theoretically unreachable states (e.g., ASCII decode failure), insert a
扫描目标代码中的以下模式并进行替换:
-
Blob处理
- 如果看到仅用于迭代或索引 → 优先直接索引
let bytes = Blob.toArray(data):Blob。data[i] - 将缓存到局部变量;选择合适的宽度(通常为
data.size())。Nat64
- 如果看到
-
整数转换与位操作
- 避免为位操作将扩展为
Nat8(任意精度)。Nat - 对固定宽度类型进行分阶段扩展:在移位和掩码操作前执行。
b.toNat16().toNat32() - 在/
Nat32中间值上执行Nat64、<<、>>、&操作,必要时在最后缩窄类型。| - 当不变量保证不会发生溢出时,对循环索引使用包装算术;优先使用宽计数器(例如
+%)。Nat64
- 避免为位操作将
-
字母表/查找表
- 避免在热点路径中使用/
Char表。将字母表存储为Text(UTF-8/ASCII编码)以便直接输出字节。[Nat8]
- 避免在热点路径中使用
-
文本构建
- 避免逐个字符调用并重复执行
Text.fromChar拼接。# - 以块为单位输出文本:构建一个小型ASCII(例如4–8字节),调用一次
Blob,然后每个块仅拼接一次。Text.decodeUtf8 - 尽量减少操作;每4–8个输出字符拼接一次比多次拼接成本低得多。
#
- 避免逐个字符调用
-
循环结构
- 以统一的块处理输入(例如3字节→4个字符),然后单独处理少量剩余数据(0–2字节)。
- 将不变量(大小、常量)提升到循环外;保持内部循环为直线代码。
-
正确性防护
- 对于理论上不可达的状态(例如ASCII解码失败),插入带有清晰消息的。
Prim.trap("…") - 仔细检查块边界处的差一错误(例如)。
next_i <= sz
- 对于理论上不可达的状态(例如ASCII解码失败),插入带有清晰消息的
Before → After Patterns
优化前后示例
- Avoid Blob→Array conversion
motoko
// Before
let bytes = Blob.toArray(data);
var i = 0;
let b1 = bytes[i];
// After
let sz = Nat64.fromIntWrap(data.size());
var i : Nat64 = 0;
let b1 = data[i.toNat()];- Prefer fixed-width ints and staged widening
motoko
// Before (goes through arbitrary-precision Nat)
let n = (Nat32.fromNat(Nat8.toNat(b1)) << 16)
| (Nat32.fromNat(Nat8.toNat(b2)) << 8)
| Nat32.fromNat(Nat8.toNat(b3));
// After (fixed-width path)
let n = (b1.toNat16().toNat32() << 16)
| (b2.toNat16().toNat32() << 8)
| b3.toNat16().toNat32();- Replace /
Charalphabet + per‑char concat withText+ block decode[Nat8]
motoko
// Before
private let alphabet : [Char] = ['A', 'B', /*…*/, '/'];
let c1 = Text.fromChar(alphabet[idx1]);
let c2 = Text.fromChar(alphabet[idx2]);
result #= c1 # c2 # c3 # c4; // many small concats
// After
private let alphabet : [Nat8] = [65, 66, /*…*/, 47]; // ASCII bytes
let bytes = Blob.fromArray([
alphabet[idx1], alphabet[idx2], alphabet[idx3], alphabet[idx4]
]);
switch (Text.decodeUtf8(bytes)) {
case (?t) { result := result # t } // one append per block
case (_) { Prim.trap("Cannot happen: Utf8 decode error …") }
};- Blocked processing with tail
motoko
// Example: main loop over full blocks, then a small tail path
var i : Nat64 = 0;
var next_i : Nat64 = block; // e.g., 3, 6, etc.
while (next_i <= sz) {
// read <block> bytes, produce <k> output chars
i := next_i; next_i +%= block;
};
while (i < sz) {
// read remaining bytes (tail), produce padded output as needed
i +%= tailStep;
};- 避免Blob→数组转换
motoko
// Before
let bytes = Blob.toArray(data);
var i = 0;
let b1 = bytes[i];
// After
let sz = Nat64.fromIntWrap(data.size());
var i : Nat64 = 0;
let b1 = data[i.toNat()];- 优先使用固定宽度整数和分阶段扩展
motoko
// Before (goes through arbitrary-precision Nat)
let n = (Nat32.fromNat(Nat8.toNat(b1)) << 16)
| (Nat32.fromNat(Nat8.toNat(b2)) << 8)
| Nat32.fromNat(Nat8.toNat(b3));
// After (fixed-width path)
let n = (b1.toNat16().toNat32() << 16)
| (b2.toNat16().toNat32() << 8)
| b3.toNat16().toNat32();- 用+块解码替代
[Nat8]/Char字母表+逐个字符拼接Text
motoko
// Before
private let alphabet : [Char] = ['A', 'B', /*…*/, '/'];
let c1 = Text.fromChar(alphabet[idx1]);
let c2 = Text.fromChar(alphabet[idx2]);
result #= c1 # c2 # c3 # c4; // many small concats
// After
private let alphabet : [Nat8] = [65, 66, /*…*/, 47]; // ASCII bytes
let bytes = Blob.fromArray([
alphabet[idx1], alphabet[idx2], alphabet[idx3], alphabet[idx4]
]);
switch (Text.decodeUtf8(bytes)) {
case (?t) { result := result # t } // one append per block
case (_) { Prim.trap("Cannot happen: Utf8 decode error …") }
};- 块处理加剩余数据处理
motoko
// Example: main loop over full blocks, then a small tail path
var i : Nat64 = 0;
var next_i : Nat64 = block; // e.g., 3, 6, etc.
while (next_i <= sz) {
// read <block> bytes, produce <k> output chars
i := next_i; next_i +%= block;
};
while (i < sz) {
// read remaining bytes (tail), produce padded output as needed
i +%= tailStep;
};Step‑by‑Step Procedure (Agent Playbook)
分步操作流程(Agent操作手册)
-
Scoping
- Identify hot byte/bit paths: encoders/decoders, hashing, binary parsers.
- If needed, prepare a small benchmark harness (see skills/benchmarks-generation/SKILL.md).
-
Baseline
- Run a benchmark on realistic inputs (sequential bytes, random/mixed); record results.
-
Transformations (apply incrementally; keep each commit focused)
- Remove used solely for iteration or random access; index
Blob.toArraydirectly.Blob - Cache sizes; switch to index; use
Nat64where safe.+% - Replace /
Charalphabets withTexttables.[Nat8] - Convert per‑char concatenations into block emission using +
Blob.fromArray.Text.decodeUtf8 - Use intermediates for shifts and masks; avoid
Nat16/Nat32/Nat64widening in hot loops.Nat - Reshape loops into large uniform blocks plus a small tail path.
- Eliminate dead temporaries; prefer straight‑line code.
- Add explicit for logically unreachable decode errors.
Prim.trap
- Remove
-
Validation
- Unit tests: small ASCII examples, padding/edge cases, long sequential bytes; assert output length and alphabet membership where relevant.
- Optional property checks: random inputs vs a reference implementation.
- Benchmarks: verify improvements or parity across patterns and sizes (use skills/benchmarks-generation/SKILL.md).
-
范围界定
- 识别字节/位热点路径:编码器/解码器、哈希、二进制解析器。
- 如有需要,准备一个小型基准测试工具(参考skills/benchmarks-generation/SKILL.md)。
-
基准测试
- 使用真实输入(连续字节、随机/混合数据)运行基准测试;记录结果。
-
转换优化(逐步应用;每次提交聚焦单一修改)
- 移除仅用于迭代或随机访问的调用;直接索引
Blob.toArray。Blob - 缓存大小;切换为索引;在安全场景下使用
Nat64。+% - 将/
Char字母表替换为Text表。[Nat8] - 将逐个字符的拼接转换为使用+
Blob.fromArray的块输出。Text.decodeUtf8 - 对移位和掩码操作使用中间值;避免在热点循环中将值扩展为
Nat16/Nat32/Nat64。Nat - 将循环重构为大的统一块加少量剩余数据处理的结构。
- 消除无用的临时变量;优先使用直线代码。
- 为逻辑上不可达的解码错误添加显式。
Prim.trap
- 移除仅用于迭代或随机访问的
-
验证
- 单元测试:小型ASCII示例、填充/边界情况、长连续字节;在相关场景下断言输出长度和字母表成员资格。
- 可选属性检查:将随机输入与参考实现进行对比。
- 基准测试:验证不同模式和输入大小下的性能提升或一致性(参考skills/benchmarks-generation/SKILL.md)。
Common Pitfalls
常见陷阱
-
Unnecessary widening toin tight loops
Nat- Why it hurts: arbitrary-precision arithmetic is slower and allocates.
- Fix: keep operations on /
Nat32intermediates; stageNat64.Nat8 -> Nat16 -> Nat32
-
Per-characteroperations
Text- Why it hurts: + repeated
Text.fromCharcreates many small allocations.# - Fix: emit ASCII bytes into a block and
Blobonce per block, then append once.Text.decodeUtf8
- Why it hurts:
-
Materializingas
Blobto iterate[Nat8]- Why it hurts: full-buffer allocation and copy on the hot path.
- Fix: index directly and cache
Blob.size()
-
Overflow checks on loop counters
- Why it hurts: extra checks per iteration when they are provably unnecessary.
- Fix: widen counter to and use
Nat64under a clear no-overflow invariant.+%
-
Off-by-one at block boundaries
- Symptom: traps or incorrect output near the end of input.
- Fix: use for the main loop; handle the remaining tail explicitly.
next_i <= sz
-
紧凑循环中不必要地扩展为类型
Nat- 危害:任意精度算术速度较慢且会分配内存。
- 修复:在/
Nat32中间值上执行操作;分阶段执行Nat64扩展。Nat8 -> Nat16 -> Nat32
-
逐个字符的操作
Text- 危害:+ 重复
Text.fromChar拼接会产生大量小内存分配。# - 修复:将ASCII字节输出到块中,每个块调用一次
Blob,然后拼接一次。Text.decodeUtf8
- 危害:
-
将实例化为
Blob以进行迭代[Nat8]- 危害:在热点路径中进行全缓冲区分配和复制。
- 修复:直接索引并缓存
Blob。size()
-
对循环计数器进行溢出检查
- 危害:当可以证明溢出不会发生时,每次迭代的额外检查会增加开销。
- 修复:将计数器扩展为,并在明确无溢出的不变量下使用
Nat64。+%
-
块边界处的差一错误
- 症状:输入末尾附近出现陷阱或输出错误。
- 修复:主循环使用;显式处理剩余数据。
next_i <= sz
Safety & Edge Cases
安全性与边界情况
- Wrapping arithmetic: Use only when an invariant guarantees no overflow in the chosen width; prefer widening to
+%.Nat64 - : Safe for ASCII bytes (e.g., codec alphabets and
Text.decodeUtf8padding). Keep a defensive=for auditability.Prim.trap - Bounds: Cleanly separate the fast block loop and the tail; test edge sizes (e.g., 0, 1, 2, 3, 5, 6, 7 depending on block size).
- Allocation profile: Ensure you removed buffer materialization () and minimized concatenations.
Blob.toArray
- 包装算术:仅当不变量保证所选宽度不会发生溢出时才使用;优先扩展为
+%。Nat64 - :对ASCII字节(例如编码字母表和
Text.decodeUtf8填充)是安全的。保留防御性的=以提高可审计性。Prim.trap - 边界:清晰分离快速块循环和剩余数据处理;测试边界大小(例如根据块大小测试0、1、2、3、5、6、7)。
- 内存分配情况:确保已移除缓冲区实例化()并最小化拼接操作。
Blob.toArray
Validation Checklist (copy/paste)
验证检查清单(可复制粘贴)
- Tests cover: empty, minimal inputs, pad edges, long sequential bytes, multi-sentence ASCII where applicable.
- Output length formula holds (as defined by the codec or algorithm).
- Alphabet membership check passes; no stray chars (for encoders/decoders).
- Bench shows improvement or parity; no regressions for mixed vs zero patterns (see skills/benchmarks-generation/SKILL.md).
- No traps at block boundaries; indices don’t overflow.
- 测试覆盖:空输入、最小输入、填充边界、长连续字节、适用场景下的多句ASCII文本。
- 输出长度公式成立(由编码或算法定义)。
- 字母表成员资格检查通过;无多余字符(针对编码器/解码器)。
- 基准测试显示性能提升或一致性;混合模式与零模式无性能退化(参考skills/benchmarks-generation/SKILL.md)。
- 块边界处无陷阱;索引不会溢出。
Commit Message Templates (optional)
提交消息模板(可选)
Keep one change per commit when possible. Examples:
- Skip conversion from Blob to Array in encoder.
- Avoid unnecessary Char→Text conversion; use alphabet.
[Nat8] - Reuse and use
data.size()indices.Nat64 - Convert to
Nat8viaNat32for bit ops.Nat16 - Process input in uniform blocks and handle tail separately.
- Use for loop index under explicit invariant; widen index width.
+% - Replace many small concatenations with a single append of a decoded block.
# - Remove temporary variables in tight loops.
c1..c4 - Trap in unreachable error path.
decodeUtf8
尽可能每次提交仅包含一项修改。示例:
- 编码器中跳过Blob到数组的转换。
- 避免不必要的Char→Text转换;使用字母表。
[Nat8] - 复用并使用
data.size()索引。Nat64 - 将通过
Nat8转换为Nat16以进行位操作。Nat32 - 以统一块处理输入并单独处理剩余数据。
- 在明确不变量下对循环索引使用;扩展索引宽度。
+% - 将多次小范围拼接替换为解码块的单次拼接。
# - 移除紧凑循环中的临时变量。
c1..c4 - 在不可达的错误路径中添加陷阱。
decodeUtf8
Outcome
优化成果
Applying these techniques typically yields:
- Fewer allocations by avoiding full-buffer materialization and per-char ops
Text - Faster bit-packing/unpacking via fixed-width arithmetic
- Reduced concatenation overhead by emitting larger chunks
Text - Clearer separation of fast-path block processing and tail handling
- Safer, more auditable code via explicit invariants and traps
应用这些技术通常会带来以下效果:
- 通过避免全缓冲区实例化和逐个字符的操作减少内存分配
Text - 通过固定宽度算术加快位打包/解包速度
- 通过输出更大的块减少拼接开销
Text - 清晰分离快速路径块处理和剩余数据处理
- 通过显式不变量和陷阱使代码更安全、更易于审计
Block-Cipher / Hash-Function Hot Paths
块密码/哈希函数热点路径
Techniques in this section are specifically for MD-style block hash functions and similar block-structured primitives (RIPEMD-160, SHA-1/2, MD5, Blake, block ciphers, etc.). They are not general advice — most code shouldn't look like this.
本节技术专门针对MD风格块哈希函数和类似的块结构原语(RIPEMD-160、SHA-1/2、MD5、Blake、块密码等)。它们不是通用建议——大多数代码不应采用这种风格。
When These Techniques Apply
适用场景
ALL of the following must be true to justify this style of optimization:
- Fixed-size input blocks processed by a transform function. The algorithm specifies an N-byte block (e.g., 64 bytes for SHA-256/RIPEMD-160) that is repeatedly fed to a compression function.
- Many rounds per block, each updating a small fixed set of state words. Typically 64–160 rounds operating on 4–8 /
Nat32chaining variables. Tuple/record allocations on the per-round path dominate cost.Nat64 - Public streaming API (/
write+update/sum) where partial blocks must be buffered between calls.finalize - Throughput matters. The function is in a measurable hot path (e.g., signing, address derivation, large-message hashing).
If your code is one-shot, processes variable-size frames, or runs only a handful of times per request — stop. Use plain, readable code instead. Reference: was rewritten for ~2× instructions and ~3.6× less GC at 1 KiB inputs; the techniques cost ~300 lines of inlined rounds and are only worth it for primitives that consumers depend on heavily.
src/Ripemd160.mo必须同时满足以下所有条件才适合采用这种优化风格:
- 固定大小输入块由转换函数处理。算法指定一个N字节的块(例如SHA-256/RIPEMD-160为64字节),该块被反复输入到压缩函数中。
- 每个块包含多轮操作,每轮更新少量固定状态字。通常为64–160轮操作,针对4–8个/
Nat32链变量。每轮路径中的元组/记录分配占主导成本。Nat64 - 公开流式API(/
write+update/sum),其中部分块必须在调用之间缓冲。finalize - 吞吐量至关重要。该函数位于可测量的热点路径中(例如签名、地址推导、大消息哈希)。
如果你的代码是一次性的、处理可变大小帧的,或者每个请求仅运行几次——停止优化。使用简洁易读的代码即可。参考:经过重写后,1 KiB输入的指令数减少约2倍,GC开销减少约3.6倍;这些技术需要约300行内联轮次代码,仅对消费者高度依赖的原语才值得采用。
src/Ripemd160.moThe Reference Pattern (research-ag/sha2 style)
参考模式(research-ag/sha2风格)
The combined patterns below are the same shape used by 's SHA-256 and were applied successfully to . Apply them together — partial application leaves most of the gains on the table.
research-ag/sha2src/Ripemd160.mo以下组合模式与的SHA-256使用的模式相同,并成功应用于。需组合应用这些模式——部分应用无法获得大部分收益。
research-ag/sha2src/Ripemd160.moPattern 1 — Mutable chaining state in [var Nat32]
(or [var Nat64]
)
[var Nat32][var Nat64]模式1 — 使用[var Nat32]
(或[var Nat64]
)存储可变链状态
[var Nat32][var Nat64]motoko
// Persistent chaining state, single allocation reused across all blocks.
private let s : [var Nat32] = VarArray.repeat<Nat32>(0, 5);Don't use individual fields for the chaining variables when there are many of them — the array slot stores fixed-width words inline. Don't split into halves unless you measure a benefit; for ≤8 state words it's not worth it.
varNat16motoko
// 持久化链状态,单次分配可在所有块中复用。
private let s : [var Nat32] = VarArray.repeat<Nat32>(0, 5);当存在多个链变量时,不要为每个变量使用单独的字段——数组槽会直接存储固定宽度字。除非测量到收益,否则不要拆分为半字;对于≤8个状态字,这样做不值得。
varNat16Pattern 2 — Pre-decoded message schedule in [var Nat32]
(NOT [var Nat8]
)
[var Nat32][var Nat8]模式2 — 使用[var Nat32]
存储预解码消息调度(而非[var Nat8]
)
[var Nat32][var Nat8]motoko
// 16 little-endian (or big-endian) words for the current block.
// Bytes are folded in at write time; transform() reads words directly.
private let msg : [var Nat32] = VarArray.repeat<Nat32>(0, 16);This replaces the common (and slow) pattern:
motoko
// AVOID for hot-path block hashing
private let buf : [var Nat8] = ...; // 64-byte buffer
transform(buf.toArray(), 0); // allocates a 64-byte [Nat8] PER BLOCK
// transform then re-decodes 4 bytes → Nat32 word internallyThe design eliminates both the per-block copy and the read-side decoding inside .
[var Nat32]toArray()transform()motoko
// 当前块的16个小端(或大端)字。
// 字节在写入时折叠;transform()直接读取字。
private let msg : [var Nat32] = VarArray.repeat<Nat32>(0, 16);这替代了常见的(且缓慢的)模式:
motoko
// 热点路径块哈希应避免使用此模式
private let buf : [var Nat8] = ...; // 64字节缓冲区
transform(buf.toArray(), 0); // 每个块分配一个64字节的[Nat8]
// transform随后在内部重新解码4字节→Nat32字[var Nat32]toArray()transform()Pattern 3 — Unboxed Nat16
byte counter for partial-block position
Nat16模式3 — 使用未装箱的Nat16
字节计数器记录部分块位置
Nat16motoko
// 0..63 byte position within the current block.
// Nat16 is unboxed in mutable storage; Nat is heap-allocated per increment.
private var i_msg : Nat16 = 0;Use (or if the range fits) for any small mutable counter on a hot path — allocates a boxed bignum on every assignment. is also unboxed but wastes register width when the value is provably small.
Nat16Nat8var x : Nat = 0Nat64motoko
// 当前块内的0..63字节位置。
// Nat16在可变存储中是未装箱的;Nat每次递增都会在堆上分配。
private var i_msg : Nat16 = 0;对热点路径上的任何小型可变计数器使用(或如果范围合适则使用)——每次赋值都会分配一个装箱的大数。也是未装箱的,但当值可证明较小时会浪费寄存器宽度。
Nat16Nat8var x : Nat = 0Nat64Pattern 4 — writeByte
folds bytes into the word schedule
writeByte模式4 — writeByte
将字节折叠到字调度中
writeBytemotoko
private func writeByte(b : Nat8) {
let pos = i_msg;
let wi = Nat16.toNat(pos >> 2);
let lane = pos & 0x3;
let v : Nat32 = Nat32.fromNat16(b.toNat16()) << Nat32.fromNat16(lane << 3);
if (lane == 0) { msg[wi] := v } // first byte: overwrite stale word
else { msg[wi] := msg[wi] | v }; // subsequent bytes: OR in
let next = pos +% 1;
if (next == 64) { transform(); n_blocks +%= 1; i_msg := 0 }
else { i_msg := next };
};Adjust the lane shift formula for big-endian schedules (e.g., SHA-2): .
(3 - lane) << 3motoko
private func writeByte(b : Nat8) {
let pos = i_msg;
let wi = Nat16.toNat(pos >> 2);
let lane = pos & 0x3;
let v : Nat32 = Nat32.fromNat16(b.toNat16()) << Nat32.fromNat16(lane << 3);
if (lane == 0) { msg[wi] := v } // 第一个字节:覆盖旧字
else { msg[wi] := msg[wi] | v }; // 后续字节:按位或
let next = pos +% 1;
if (next == 64) { transform(); n_blocks +%= 1; i_msg := 0 }
else { i_msg := next };
};针对大端调度(例如SHA-2)调整通道移位公式:。
(3 - lane) << 3Pattern 5 — Fast-path block decoder in write
write模式5 — write
中的快速路径块解码器
writemotoko
public func write(data : [Nat8]) {
let n = data.size();
if (n == 0) return;
var i = 0;
// (1) Finish any partial block one byte at a time.
while (i_msg != 0 and i < n) { writeByte(data[i]); i += 1 };
// (2) Fast path: decode 16 LE words inline directly from input → msg.
while (i + 64 <= n) {
msg[0] := data[i].toNat16().toNat32()
| (data[i+1].toNat16().toNat32() << 8)
| (data[i+2].toNat16().toNat32() << 16)
| (data[i+3].toNat16().toNat32() << 24);
// ... msg[1] .. msg[15] (15 more identical lines)
transform();
n_blocks +%= 1;
i += 64;
};
// (3) Tail: remaining < 64 bytes go into the partial block.
while (i < n) { writeByte(data[i]); i += 1 };
};Three loops, not one. The fast path is what makes large inputs cheap.
motoko
public func write(data : [Nat8]) {
let n = data.size();
if (n == 0) return;
var i = 0;
// (1) 逐个字节完成部分块。
while (i_msg != 0 and i < n) { writeByte(data[i]); i += 1 };
// (2) 快速路径:直接从输入将16个小端字解码到msg中。
while (i + 64 <= n) {
msg[0] := data[i].toNat16().toNat32()
| (data[i+1].toNat16().toNat32() << 8)
| (data[i+2].toNat16().toNat32() << 16)
| (data[i+3].toNat16().toNat32() << 24);
// ... msg[1] .. msg[15](另外15行相同代码)
transform();
n_blocks +%= 1;
i += 64;
};
// (3) 剩余数据:少于64字节的部分进入部分块。
while (i < n) { writeByte(data[i]); i += 1 };
};三个循环,而非一个。快速路径是处理大输入时成本低廉的关键。
Pattern 6 — Inline ALL rounds inside transform()
transform()模式6 — 将所有轮次内联到transform()
中
transform()The single biggest win for multi-round primitives. The original RIPEMD-160 code looked like:
motoko
// AVOID: each call allocates a (Nat32, Nat32) tuple — 320 tuples per block
let (a, c) = r11(a, b, c, d, e, msg, 0, 11);
let (e, b) = r11(e, a, b, c, d, msg, 1, 14);
// ... 158 moreInline every round into a flat sequence of statement updates on local mutable vars:
motoko
var a1 : Nat32 = s[0]; var b1 : Nat32 = s[1]; /* ... */
// Left line round 1: f1(b,c,d) = b ^ c ^ d, K = 0
a1 := rol(a1 +% (b1 ^ c1 ^ d1) +% w0, 11) +% e1; c1 := rol(c1, 10);
e1 := rol(e1 +% (a1 ^ b1 ^ c1) +% w1, 14) +% d1; b1 := rol(b1, 10);
// ... 158 more, with K and rotation amounts varying per round
// Combine back to s without allocating temporaries
let t = s[0];
s[0] := s[1] +% c1 +% d2;
// ...Yes, this is hundreds of lines. Yes, it's mechanically transcribed from the spec or the original tuple-returning helpers. Comment each round group with its and . The compiler does not currently inline these helpers automatically, and the tuple allocation per call is real.
fK这是多轮原语的最大优化点。原始RIPEMD-160代码如下:
motoko
// 避免使用:每次调用都会分配一个(Nat32, Nat32)元组——每个块320个元组
let (a, c) = r11(a, b, c, d, e, msg, 0, 11);
let (e, b) = r11(e, a, b, c, d, msg, 1, 14);
// ... 另外158行将每一轮内联为对局部可变变量的扁平语句更新序列:
motoko
var a1 : Nat32 = s[0]; var b1 : Nat32 = s[1]; /* ... */
// 左线路轮次1:f1(b,c,d) = b ^ c ^ d, K = 0
a1 := rol(a1 +% (b1 ^ c1 ^ d1) +% w0, 11) +% e1; c1 := rol(c1, 10);
e1 := rol(e1 +% (a1 ^ b1 ^ c1) +% w1, 14) +% d1; b1 := rol(b1, 10);
// ... 另外158行,每轮的K和旋转量不同
// 将结果合并回s,不分配临时变量
let t = s[0];
s[0] := s[1] +% c1 +% d2;
// ...是的,这需要数百行代码。是的,它是从规范或原始返回元组的辅助函数机械转录而来的。为每个轮次组添加注释说明其和。编译器目前不会自动内联这些辅助函数,每次调用的元组分配成本是真实存在的。
fKPattern 7 — Boxing-free Nat64
→ Nat8
for length encoding
Nat64Nat8模式7 — 无装箱的Nat64
→Nat8
长度编码
Nat64Nat8Padding writes the bit-length as 8 little-endian bytes. The naive round-trips through arbitrary-precision , allocating per byte:
Nat8.fromNat(Nat64.toNat(x & 0xff))Natmotoko
// Stage all narrowing on fixed-width types — no Nat allocation.
private func lowByte64(v : Nat64) : Nat8 {
Nat8.fromNat16(Nat16.fromNat32(Nat32.fromNat64(v & 0xff)));
};填充操作会将位长度写入8个小端字节。朴素的会通过任意精度的进行往返转换,每个字节都会分配内存:
Nat8.fromNat(Nat64.toNat(x & 0xff))Natmotoko
// 在固定宽度类型上分阶段缩窄——不分配Nat。
private func lowByte64(v : Nat64) : Nat8 {
Nat8.fromNat16(Nat16.fromNat32(Nat32.fromNat64(v & 0xff)));
};Pattern 8 — Padding via writeByte
, not allocated [var Nat8]
writeByte[var Nat8]模式8 — 通过writeByte
进行填充,而非分配[var Nat8]
writeByte[var Nat8]motoko
public func sum() : [Nat8] {
let bitlen : Nat64 = ((n_blocks << 6) +% Nat64.fromNat(Nat16.toNat(i_msg))) << 3;
writeByte(0x80);
while (i_msg != 56) { writeByte(0) };
writeByte(lowByte64(bitlen));
writeByte(lowByte64(bitlen >> 8));
// ... 6 more length bytes; the 8th triggers transform()
// serialize state to 20/32 output bytes
};This eliminates the typical and allocations plus their copies.
pad : [var Nat8]sizedesc : [var Nat8].toArray()motoko
public func sum() : [Nat8] {
let bitlen : Nat64 = ((n_blocks << 6) +% Nat64.fromNat(Nat16.toNat(i_msg))) << 3;
writeByte(0x80);
while (i_msg != 56) { writeByte(0) };
writeByte(lowByte64(bitlen));
writeByte(lowByte64(bitlen >> 8));
// ... 另外6个长度字节;第8个会触发transform()
// 将状态序列化为20/32输出字节
};这消除了典型的和分配及其复制操作。
pad : [var Nat8]sizedesc : [var Nat8].toArray()Anti-Patterns To Remove When Adopting This Style
采用此风格时需移除的反模式
- on a per-byte path → use
var counter : Nat = 0/Nat16.Nat64 - Returning multi-value tuples from per-round helpers.
(Nat32, Nat32) - Per-block to hand bytes to
buf.toArray().transform - Allocating and
padarrays insizedesc/sum.finalize - Wrapping each round's bit operations with calls when you can keep the words pre-decoded in
Common.readLE32(arr, i).[var Nat32] - Going through for any byte-level conversion in the hot path.
Nat
- 逐字节路径上的→ 使用
var counter : Nat = 0/Nat16。Nat64 - 从每轮辅助函数返回多值元组。
(Nat32, Nat32) - 每个块调用以将字节传递给
buf.toArray()。transform - 在/
sum中分配finalize和pad数组。sizedesc - 当可以将字预解码到中时,使用
[var Nat32]调用包装每轮的位操作。Common.readLE32(arr, i) - 在热点路径中通过进行任何字节级转换。
Nat
Operator-Precedence Caution
运算符优先级注意事项
Motoko's (XOR), , , , , precedences may not match C/Rust intuition. Aggressively parenthesize f-functions and round expressions:
^&|+%<<>>motoko
a1 := rol(a1 +% ((b1 & c1) | (^b1 & d1)) +% w0 +% 0x5A827999, 11) +% e1;Don't trust the compiler to group as if you haven't checked. Adding parens costs nothing at runtime.
^b1 & d1(^b1) & d1Motoko的(异或)、、、、、优先级可能与C/Rust的直觉不符。积极添加括号包裹f函数和轮次表达式:
^&|+%<<>>motoko
a1 := rol(a1 +% ((b1 & c1) | (^b1 & d1)) +% w0 +% 0x5A827999, 11) +% e1;如果你未检查,不要信任编译器会将分组为。添加括号在运行时不会产生任何成本。
^b1 & d1(^b1) & d1Validation Workflow (Required)
验证流程(必需)
These optimizations are easy to mis-transcribe (160 round lines for RIPEMD-160 alone). Before claiming success:
- Full test suite must pass. Hash test vectors are non-negotiable — a single wrong rotation amount, K constant, or message-word index will corrupt all outputs but may still produce stable-looking bytes.
- Benchmark against the previous version. If instructions don't drop ≥1.5× and GC doesn't drop ≥2× at large inputs, something is wrong with the buffering or rounds are still allocating.
- Benchmark against an external reference if one exists (e.g., another mops package). Confirms you're not just shuffling cost around.
这些优化容易出现转录错误(仅RIPEMD-160就有160行轮次代码)。在宣称优化成功前:
- 必须通过完整测试套件。哈希测试向量是必不可少的——单个错误的旋转量、K常量或消息字索引会破坏所有输出,但仍可能产生看似稳定的字节。
- 与之前版本进行基准测试对比。如果大输入下指令数未减少≥1.5倍且GC开销未减少≥2倍,则缓冲或轮次仍存在分配问题。
- 与外部参考实现进行基准测试对比(如果存在,例如其他mops包)。确认你不是仅仅将成本转移到其他地方。
When NOT To Apply These Patterns
不适用场景
- One-shot or low-frequency code. A 5-line hash call invoked once per request: leave the readable version alone.
- Algorithms with variable-length blocks (e.g., compression, parsing). The schedule assumes fixed-width words at fixed positions.
msg : [var Nat32] - Algorithms with few rounds (e.g., ). Inlining is overkill; the tuple-allocation cost is small relative to other work.
< 16 - Code that needs to remain easy to audit against a spec. Inlined rounds are harder to read than a loop. For new primitives, write the clear version first, ship it, and only inline if benchmarks justify it.
for r in rounds - Generic/parametric implementations. This style hard-codes the algorithm; you can't easily share across hash variants.
transform()
- 一次性或低频率代码。每个请求仅调用一次的5行哈希代码:保留易读版本即可。
- 可变大小块算法(例如压缩、解析)。调度假设固定宽度字位于固定位置。
msg : [var Nat32] - 轮次较少的算法(例如<16轮)。内联是过度优化;元组分配成本相对于其他工作来说很小。
- 需要易于对照规范审计的代码。内联轮次比循环更难阅读。对于新原语,先编写清晰的版本,发布,仅当基准测试证明必要时再进行内联。
for r in rounds - 通用/参数化实现。这种风格硬编码了算法;无法轻松在哈希变体之间共享。
transform()
Outcome (Reference: RIPEMD-160 in this repo)
优化成果(参考:本仓库中的RIPEMD-160)
Measured at 1024-byte input:
| Metric | Before | After | Speedup |
|---|---|---|---|
| Instructions | 1,470,125 | 734,849 | 2.0× |
| GC traffic | 160.28 KiB | 43.88 KiB | 3.65× |
| Heap (steady) | 272 B | 272 B | 1.0× |
Public API and observable behavior unchanged; all 24 test files pass.
针对1024字节输入的测量结果:
| 指标 | 优化前 | 优化后 | 加速比 |
|---|---|---|---|
| 指令数 | 1,470,125 | 734,849 | 2.0× |
| GC流量 | 160.28 KiB | 43.88 KiB | 3.65× |
| 堆内存(稳定) | 272 B | 272 B | 1.0× |
公开API和可观察行为未改变;所有24个测试文件均通过。