motoko-performance-optimizations

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Motoko Performance Optimizations

Motoko 性能优化

What This Is

内容简介

An extensible guide for speeding up Motoko code safely and predictably. It focuses on mechanical, behavior-preserving improvements: allocation reduction, fixed-width arithmetic, block processing, efficient
Text
building, and clear loop shapes. Use this skill when you want to improve throughput/latency without changing semantics.
  • Benchmarking details and harnesses live in: skills/benchmarks-generation/SKILL.md
  • Style and safe refactors that often precede perf work: skills/code-improvements/SKILL.md
  • Dot-notation improvements that reduce verbosity/overhead: skills/dot-notation-migration/SKILL.md
这是一份可扩展的指南,用于安全且可预测地提升Motoko代码的运行速度。它专注于机械性、不改变行为的优化:减少内存分配、固定宽度算术、块处理、高效
Text
构建,以及清晰的循环结构。当你希望在不改变语义的前提下提升吞吐量或降低延迟时,可以使用本技巧。
  • 基准测试细节和测试工具位于:skills/benchmarks-generation/SKILL.md
  • 通常在性能优化前进行的代码风格与安全重构:skills/code-improvements/SKILL.md
  • 减少冗余与开销的点符号优化:skills/dot-notation-migration/SKILL.md

Quick Wins (General)

通用快速优化方案

  • Minimize allocations in hot paths
    • Avoid materializing entire buffers just to iterate (e.g., prefer indexing
      Blob
      directly).
    • Reuse lengths and capacities; cache
      size()
      calls to a local variable.
  • Prefer fixed-width arithmetic in tight loops
    • Keep shifts/masks on
      Nat32
      /
      Nat64
      intermediates; avoid widening to
      Nat
      mid-loop.
  • Build
    Text
    in larger chunks
    • Avoid per-character
      Text.fromChar
      + many small
      #
      appends; emit ASCII/UTF‑8 blocks and append once per block.
  • Shape loops for the steady state
    • Process uniform blocks in the main loop; handle a tiny tail separately.
    • Hoist invariants (sizes, constants) and keep the inner loop straight-line.
  • Verify after each change
    • Rebuild, run unit tests/property checks, then benchmark representative inputs.
  • 减少热点路径中的内存分配
    • 避免为了迭代而将整个缓冲区实例化(例如,优先直接索引
      Blob
      )。
    • 复用长度和容量;将
      size()
      调用结果缓存到局部变量中。
  • 在紧凑循环中优先使用固定宽度算术
    • Nat32
      /
      Nat64
      中间值执行移位/掩码操作;避免在循环中途将值扩展为
      Nat
      类型。
  • 以更大的块构建
    Text
    • 避免逐个字符调用
      Text.fromChar
      并进行多次小范围
      #
      拼接;输出ASCII/UTF‑8块,每个块仅拼接一次。
  • 为稳定状态设计循环结构
    • 在主循环中处理统一的块;单独处理少量剩余数据。
    • 将不变量(大小、常量)提升到循环外,保持内部循环为直线代码。
  • 每次修改后进行验证
    • 重新构建代码,运行单元测试/属性检查,然后针对代表性输入进行基准测试。

Technique Areas Overview

技术领域概述

  • Byte/Bit Hot‑Path Techniques (detailed below)
  • Allocation & Memory Management (coming soon)
  • Text/String Building Patterns (coming soon)
  • Numeric/Arithmetic & Fixed‑Width Ops (coming soon)
  • Data Structures & Algorithms (coming soon)
  • Async/Await & Inter‑canister Patterns (coming soon)
  • Candid/Serialization Efficiency (coming soon)
  • Caching & Memoization (coming soon)
  • 字节/位热点路径技术(下文详细介绍)
  • 内存分配与内存管理(即将推出)
  • 文本/字符串构建模式(即将推出)
  • 数值/算术与固定宽度操作(即将推出)
  • 数据结构与算法(即将推出)
  • Async/Await与跨容器模式(即将推出)
  • Candid/序列化效率(即将推出)
  • 缓存与记忆化(即将推出)

Related Skills

相关技巧

  • Benchmarks: skills/benchmarks-generation/SKILL.md
  • Code Quality Cleanup: skills/code-improvements/SKILL.md
  • Dot‑notation Migration: skills/dot-notation-migration/SKILL.md
  • General Style: skills/motoko-general-style-guidelines/SKILL.md

  • 基准测试:skills/benchmarks-generation/SKILL.md
  • 代码质量清理:skills/code-improvements/SKILL.md
  • 点符号迁移:skills/dot-notation-migration/SKILL.md
  • 通用风格指南:skills/motoko-general-style-guidelines/SKILL.md

Byte/Bit Hot‑Path Techniques

字节/位热点路径技术

These techniques target code that iterates over
Blob
or
[Nat8]
, performs bit packing/unpacking, or emits
Text
(e.g., encoders/decoders, checksums, binary parsers). They reduce allocations, avoid expensive integer widening, and reshape loops for better throughput while preserving correctness.
这些技术针对遍历
Blob
[Nat8]
、执行位打包/解包或生成
Text
的代码(例如编码器/解码器、校验和、二进制解析器)。它们在保证正确性的前提下,减少内存分配、避免昂贵的整数扩展,并重构循环以提升吞吐量。

When To Use

适用场景

  • You see hot loops over bytes and bit operations
  • Code constructs
    Text
    character-by-character or via many small concatenations
  • Blob
    is converted to
    [Nat8]
    only to iterate or index
  • 代码中存在遍历字节和位操作的热点循环
  • 代码逐个字符或通过多次小范围拼接构建
    Text
  • Blob
    仅为了迭代或索引而转换为
    [Nat8]

Prerequisites

前置条件

  • Motoko compiler (moc) at a reasonably recent version (1.3.0+ recommended)
  • Familiarity with fixed-width integer types (
    Nat8
    ,
    Nat16
    ,
    Nat32
    ,
    Nat64
    ) and wrapping arithmetic (
    +%
    )
  • For benchmarking guidance, see: skills/benchmarks-generation/SKILL.md
  • 使用版本较新的Motoko编译器(moc),推荐1.3.0及以上版本
  • 熟悉固定宽度整数类型(
    Nat8
    Nat16
    Nat32
    Nat64
    )和包装算术(
    +%
  • 如需基准测试指导,请参考:skills/benchmarks-generation/SKILL.md

Quick Wins Checklist (Heuristics)

快速优化检查清单(启发式规则)

Scan candidate code for these patterns and replace accordingly:
  • Blob handling
    • If you see
      let bytes = Blob.toArray(data)
      just to iterate or index → Prefer indexing
      Blob
      directly:
      data[i]
      .
    • Cache
      data.size()
      to a local; choose an appropriate width (often
      Nat64
      ).
  • Integer conversions & bit ops
    • Avoid widening
      Nat8 -> Nat
      (arbitrary precision) for bit operations.
    • Use staged widening on fixed-width types:
      b.toNat16().toNat32()
      before shifts and masks.
    • Perform
      <<
      ,
      >>
      ,
      &
      ,
      |
      on
      Nat32
      /
      Nat64
      intermediates, then narrow at the end if needed.
    • Use wrapping arithmetic
      +%
      for loop indices when overflow cannot occur by invariant; prefer a wide counter (e.g.,
      Nat64
      ).
  • Alphabets / lookup tables
    • Avoid
      Char
      /
      Text
      tables in hot paths. Store alphabets as
      [Nat8]
      (UTF-8/ASCII codes) for direct byte emission.
  • Text construction
    • Avoid per-character
      Text.fromChar
      plus repeated
      #
      concatenations.
    • Emit text in blocks: build a small ASCII
      Blob
      (e.g., 4–8 bytes),
      Text.decodeUtf8
      once, then append once per block.
    • Minimize
      #
      operations; 1 append per 4–8 output chars is far cheaper than many appends.
  • Loop shape
    • Process input in uniform blocks (e.g., 3 bytes → 4 chars), then handle a small tail (0–2 bytes) separately.
    • Hoist invariants (sizes, constants); keep inner loops straight-line.
  • Correctness guards
    • For theoretically unreachable states (e.g., ASCII decode failure), insert a
      Prim.trap("…")
      with a clear message.
    • Double-check off-by-one errors at block boundaries (e.g.,
      next_i <= sz
      ).
扫描目标代码中的以下模式并进行替换:
  • Blob处理
    • 如果看到
      let bytes = Blob.toArray(data)
      仅用于迭代或索引 → 优先直接索引
      Blob
      data[i]
    • data.size()
      缓存到局部变量;选择合适的宽度(通常为
      Nat64
      )。
  • 整数转换与位操作
    • 避免为位操作将
      Nat8
      扩展为
      Nat
      (任意精度)。
    • 对固定宽度类型进行分阶段扩展:在移位和掩码操作前执行
      b.toNat16().toNat32()
    • Nat32
      /
      Nat64
      中间值上执行
      <<
      >>
      &
      |
      操作,必要时在最后缩窄类型。
    • 当不变量保证不会发生溢出时,对循环索引使用包装算术
      +%
      ;优先使用宽计数器(例如
      Nat64
      )。
  • 字母表/查找表
    • 避免在热点路径中使用
      Char
      /
      Text
      表。将字母表存储为
      [Nat8]
      (UTF-8/ASCII编码)以便直接输出字节。
  • 文本构建
    • 避免逐个字符调用
      Text.fromChar
      并重复执行
      #
      拼接。
    • 以块为单位输出文本:构建一个小型ASCII
      Blob
      (例如4–8字节),调用一次
      Text.decodeUtf8
      ,然后每个块仅拼接一次。
    • 尽量减少
      #
      操作;每4–8个输出字符拼接一次比多次拼接成本低得多。
  • 循环结构
    • 以统一的块处理输入(例如3字节→4个字符),然后单独处理少量剩余数据(0–2字节)。
    • 将不变量(大小、常量)提升到循环外;保持内部循环为直线代码。
  • 正确性防护
    • 对于理论上不可达的状态(例如ASCII解码失败),插入带有清晰消息的
      Prim.trap("…")
    • 仔细检查块边界处的差一错误(例如
      next_i <= sz
      )。

Before → After Patterns

优化前后示例

  1. Avoid Blob→Array conversion
motoko
// Before
let bytes = Blob.toArray(data);
var i = 0;
let b1 = bytes[i];

// After
let sz = Nat64.fromIntWrap(data.size());
var i : Nat64 = 0;
let b1 = data[i.toNat()];
  1. Prefer fixed-width ints and staged widening
motoko
// Before (goes through arbitrary-precision Nat)
let n = (Nat32.fromNat(Nat8.toNat(b1)) << 16)
      | (Nat32.fromNat(Nat8.toNat(b2)) << 8)
      |  Nat32.fromNat(Nat8.toNat(b3));

// After (fixed-width path)
let n = (b1.toNat16().toNat32() << 16)
      | (b2.toNat16().toNat32() << 8)
      |  b3.toNat16().toNat32();
  1. Replace
    Char
    /
    Text
    alphabet + per‑char concat with
    [Nat8]
    + block decode
motoko
// Before
private let alphabet : [Char] = ['A', 'B', /*…*/, '/'];
let c1 = Text.fromChar(alphabet[idx1]);
let c2 = Text.fromChar(alphabet[idx2]);
result #= c1 # c2 # c3 # c4; // many small concats

// After
private let alphabet : [Nat8] = [65, 66, /*…*/, 47]; // ASCII bytes
let bytes = Blob.fromArray([
  alphabet[idx1], alphabet[idx2], alphabet[idx3], alphabet[idx4]
]);
switch (Text.decodeUtf8(bytes)) {
  case (?t) { result := result # t } // one append per block
  case (_)  { Prim.trap("Cannot happen: Utf8 decode error …") }
};
  1. Blocked processing with tail
motoko
// Example: main loop over full blocks, then a small tail path
var i : Nat64 = 0;
var next_i : Nat64 = block; // e.g., 3, 6, etc.
while (next_i <= sz) {
  // read <block> bytes, produce <k> output chars
  i := next_i; next_i +%= block;
};
while (i < sz) {
  // read remaining bytes (tail), produce padded output as needed
  i +%= tailStep;
};
  1. 避免Blob→数组转换
motoko
// Before
let bytes = Blob.toArray(data);
var i = 0;
let b1 = bytes[i];

// After
let sz = Nat64.fromIntWrap(data.size());
var i : Nat64 = 0;
let b1 = data[i.toNat()];
  1. 优先使用固定宽度整数和分阶段扩展
motoko
// Before (goes through arbitrary-precision Nat)
let n = (Nat32.fromNat(Nat8.toNat(b1)) << 16)
      | (Nat32.fromNat(Nat8.toNat(b2)) << 8)
      |  Nat32.fromNat(Nat8.toNat(b3));

// After (fixed-width path)
let n = (b1.toNat16().toNat32() << 16)
      | (b2.toNat16().toNat32() << 8)
      |  b3.toNat16().toNat32();
  1. [Nat8]
    +块解码替代
    Char
    /
    Text
    字母表+逐个字符拼接
motoko
// Before
private let alphabet : [Char] = ['A', 'B', /*…*/, '/'];
let c1 = Text.fromChar(alphabet[idx1]);
let c2 = Text.fromChar(alphabet[idx2]);
result #= c1 # c2 # c3 # c4; // many small concats

// After
private let alphabet : [Nat8] = [65, 66, /*…*/, 47]; // ASCII bytes
let bytes = Blob.fromArray([
  alphabet[idx1], alphabet[idx2], alphabet[idx3], alphabet[idx4]
]);
switch (Text.decodeUtf8(bytes)) {
  case (?t) { result := result # t } // one append per block
  case (_)  { Prim.trap("Cannot happen: Utf8 decode error …") }
};
  1. 块处理加剩余数据处理
motoko
// Example: main loop over full blocks, then a small tail path
var i : Nat64 = 0;
var next_i : Nat64 = block; // e.g., 3, 6, etc.
while (next_i <= sz) {
  // read <block> bytes, produce <k> output chars
  i := next_i; next_i +%= block;
};
while (i < sz) {
  // read remaining bytes (tail), produce padded output as needed
  i +%= tailStep;
};

Step‑by‑Step Procedure (Agent Playbook)

分步操作流程(Agent操作手册)

  1. Scoping
    • Identify hot byte/bit paths: encoders/decoders, hashing, binary parsers.
    • If needed, prepare a small benchmark harness (see skills/benchmarks-generation/SKILL.md).
  2. Baseline
    • Run a benchmark on realistic inputs (sequential bytes, random/mixed); record results.
  3. Transformations (apply incrementally; keep each commit focused)
    • Remove
      Blob.toArray
      used solely for iteration or random access; index
      Blob
      directly.
    • Cache sizes; switch to
      Nat64
      index; use
      +%
      where safe.
    • Replace
      Char
      /
      Text
      alphabets with
      [Nat8]
      tables.
    • Convert per‑char concatenations into block emission using
      Blob.fromArray
      +
      Text.decodeUtf8
      .
    • Use
      Nat16/Nat32/Nat64
      intermediates for shifts and masks; avoid
      Nat
      widening in hot loops.
    • Reshape loops into large uniform blocks plus a small tail path.
    • Eliminate dead temporaries; prefer straight‑line code.
    • Add explicit
      Prim.trap
      for logically unreachable decode errors.
  4. Validation
    • Unit tests: small ASCII examples, padding/edge cases, long sequential bytes; assert output length and alphabet membership where relevant.
    • Optional property checks: random inputs vs a reference implementation.
    • Benchmarks: verify improvements or parity across patterns and sizes (use skills/benchmarks-generation/SKILL.md).
  1. 范围界定
    • 识别字节/位热点路径:编码器/解码器、哈希、二进制解析器。
    • 如有需要,准备一个小型基准测试工具(参考skills/benchmarks-generation/SKILL.md)。
  2. 基准测试
    • 使用真实输入(连续字节、随机/混合数据)运行基准测试;记录结果。
  3. 转换优化(逐步应用;每次提交聚焦单一修改)
    • 移除仅用于迭代或随机访问的
      Blob.toArray
      调用;直接索引
      Blob
    • 缓存大小;切换为
      Nat64
      索引;在安全场景下使用
      +%
    • Char
      /
      Text
      字母表替换为
      [Nat8]
      表。
    • 将逐个字符的拼接转换为使用
      Blob.fromArray
      +
      Text.decodeUtf8
      的块输出。
    • 对移位和掩码操作使用
      Nat16/Nat32/Nat64
      中间值;避免在热点循环中将值扩展为
      Nat
    • 将循环重构为大的统一块加少量剩余数据处理的结构。
    • 消除无用的临时变量;优先使用直线代码。
    • 为逻辑上不可达的解码错误添加显式
      Prim.trap
  4. 验证
    • 单元测试:小型ASCII示例、填充/边界情况、长连续字节;在相关场景下断言输出长度和字母表成员资格。
    • 可选属性检查:将随机输入与参考实现进行对比。
    • 基准测试:验证不同模式和输入大小下的性能提升或一致性(参考skills/benchmarks-generation/SKILL.md)。

Common Pitfalls

常见陷阱

  1. Unnecessary widening to
    Nat
    in tight loops
    • Why it hurts: arbitrary-precision arithmetic is slower and allocates.
    • Fix: keep operations on
      Nat32
      /
      Nat64
      intermediates; stage
      Nat8 -> Nat16 -> Nat32
      .
  2. Per-character
    Text
    operations
    • Why it hurts:
      Text.fromChar
      + repeated
      #
      creates many small allocations.
    • Fix: emit ASCII bytes into a
      Blob
      block and
      Text.decodeUtf8
      once per block, then append once.
  3. Materializing
    Blob
    as
    [Nat8]
    to iterate
    • Why it hurts: full-buffer allocation and copy on the hot path.
    • Fix: index
      Blob
      directly and cache
      size()
      .
  4. Overflow checks on loop counters
    • Why it hurts: extra checks per iteration when they are provably unnecessary.
    • Fix: widen counter to
      Nat64
      and use
      +%
      under a clear no-overflow invariant.
  5. Off-by-one at block boundaries
    • Symptom: traps or incorrect output near the end of input.
    • Fix: use
      next_i <= sz
      for the main loop; handle the remaining tail explicitly.
  1. 紧凑循环中不必要地扩展为
    Nat
    类型
    • 危害:任意精度算术速度较慢且会分配内存。
    • 修复:在
      Nat32
      /
      Nat64
      中间值上执行操作;分阶段执行
      Nat8 -> Nat16 -> Nat32
      扩展。
  2. 逐个字符的
    Text
    操作
    • 危害:
      Text.fromChar
      + 重复
      #
      拼接会产生大量小内存分配。
    • 修复:将ASCII字节输出到
      Blob
      块中,每个块调用一次
      Text.decodeUtf8
      ,然后拼接一次。
  3. Blob
    实例化为
    [Nat8]
    以进行迭代
    • 危害:在热点路径中进行全缓冲区分配和复制。
    • 修复:直接索引
      Blob
      并缓存
      size()
  4. 对循环计数器进行溢出检查
    • 危害:当可以证明溢出不会发生时,每次迭代的额外检查会增加开销。
    • 修复:将计数器扩展为
      Nat64
      ,并在明确无溢出的不变量下使用
      +%
  5. 块边界处的差一错误
    • 症状:输入末尾附近出现陷阱或输出错误。
    • 修复:主循环使用
      next_i <= sz
      ;显式处理剩余数据。

Safety & Edge Cases

安全性与边界情况

  • Wrapping arithmetic: Use
    +%
    only when an invariant guarantees no overflow in the chosen width; prefer widening to
    Nat64
    .
  • Text.decodeUtf8
    : Safe for ASCII bytes (e.g., codec alphabets and
    =
    padding). Keep a defensive
    Prim.trap
    for auditability.
  • Bounds: Cleanly separate the fast block loop and the tail; test edge sizes (e.g., 0, 1, 2, 3, 5, 6, 7 depending on block size).
  • Allocation profile: Ensure you removed buffer materialization (
    Blob.toArray
    ) and minimized concatenations.
  • 包装算术:仅当不变量保证所选宽度不会发生溢出时才使用
    +%
    ;优先扩展为
    Nat64
  • Text.decodeUtf8
    :对ASCII字节(例如编码字母表和
    =
    填充)是安全的。保留防御性的
    Prim.trap
    以提高可审计性。
  • 边界:清晰分离快速块循环和剩余数据处理;测试边界大小(例如根据块大小测试0、1、2、3、5、6、7)。
  • 内存分配情况:确保已移除缓冲区实例化(
    Blob.toArray
    )并最小化拼接操作。

Validation Checklist (copy/paste)

验证检查清单(可复制粘贴)

  • Tests cover: empty, minimal inputs, pad edges, long sequential bytes, multi-sentence ASCII where applicable.
  • Output length formula holds (as defined by the codec or algorithm).
  • Alphabet membership check passes; no stray chars (for encoders/decoders).
  • Bench shows improvement or parity; no regressions for mixed vs zero patterns (see skills/benchmarks-generation/SKILL.md).
  • No traps at block boundaries; indices don’t overflow.
  • 测试覆盖:空输入、最小输入、填充边界、长连续字节、适用场景下的多句ASCII文本。
  • 输出长度公式成立(由编码或算法定义)。
  • 字母表成员资格检查通过;无多余字符(针对编码器/解码器)。
  • 基准测试显示性能提升或一致性;混合模式与零模式无性能退化(参考skills/benchmarks-generation/SKILL.md)。
  • 块边界处无陷阱;索引不会溢出。

Commit Message Templates (optional)

提交消息模板(可选)

Keep one change per commit when possible. Examples:
  • Skip conversion from Blob to Array in encoder.
  • Avoid unnecessary Char→Text conversion; use
    [Nat8]
    alphabet.
  • Reuse
    data.size()
    and use
    Nat64
    indices.
  • Convert
    Nat8
    to
    Nat32
    via
    Nat16
    for bit ops.
  • Process input in uniform blocks and handle tail separately.
  • Use
    +%
    for loop index under explicit invariant; widen index width.
  • Replace many small
    #
    concatenations with a single append of a decoded block.
  • Remove temporary
    c1..c4
    variables in tight loops.
  • Trap in unreachable
    decodeUtf8
    error path.
尽可能每次提交仅包含一项修改。示例:
  • 编码器中跳过Blob到数组的转换。
  • 避免不必要的Char→Text转换;使用
    [Nat8]
    字母表。
  • 复用
    data.size()
    并使用
    Nat64
    索引。
  • Nat8
    通过
    Nat16
    转换为
    Nat32
    以进行位操作。
  • 以统一块处理输入并单独处理剩余数据。
  • 在明确不变量下对循环索引使用
    +%
    ;扩展索引宽度。
  • 将多次小范围
    #
    拼接替换为解码块的单次拼接。
  • 移除紧凑循环中的临时变量
    c1..c4
  • 在不可达的
    decodeUtf8
    错误路径中添加陷阱。

Outcome

优化成果

Applying these techniques typically yields:
  • Fewer allocations by avoiding full-buffer materialization and per-char
    Text
    ops
  • Faster bit-packing/unpacking via fixed-width arithmetic
  • Reduced concatenation overhead by emitting larger
    Text
    chunks
  • Clearer separation of fast-path block processing and tail handling
  • Safer, more auditable code via explicit invariants and traps

应用这些技术通常会带来以下效果:
  • 通过避免全缓冲区实例化和逐个字符的
    Text
    操作减少内存分配
  • 通过固定宽度算术加快位打包/解包速度
  • 通过输出更大的
    Text
    块减少拼接开销
  • 清晰分离快速路径块处理和剩余数据处理
  • 通过显式不变量和陷阱使代码更安全、更易于审计

Block-Cipher / Hash-Function Hot Paths

块密码/哈希函数热点路径

Techniques in this section are specifically for MD-style block hash functions and similar block-structured primitives (RIPEMD-160, SHA-1/2, MD5, Blake, block ciphers, etc.). They are not general advice — most code shouldn't look like this.
本节技术专门针对MD风格块哈希函数和类似的块结构原语(RIPEMD-160、SHA-1/2、MD5、Blake、块密码等)。它们不是通用建议——大多数代码不应采用这种风格。

When These Techniques Apply

适用场景

ALL of the following must be true to justify this style of optimization:
  1. Fixed-size input blocks processed by a transform function. The algorithm specifies an N-byte block (e.g., 64 bytes for SHA-256/RIPEMD-160) that is repeatedly fed to a compression function.
  2. Many rounds per block, each updating a small fixed set of state words. Typically 64–160 rounds operating on 4–8
    Nat32
    /
    Nat64
    chaining variables. Tuple/record allocations on the per-round path dominate cost.
  3. Public streaming API (
    write
    /
    update
    +
    sum
    /
    finalize
    ) where partial blocks must be buffered between calls.
  4. Throughput matters. The function is in a measurable hot path (e.g., signing, address derivation, large-message hashing).
If your code is one-shot, processes variable-size frames, or runs only a handful of times per request — stop. Use plain, readable code instead. Reference:
src/Ripemd160.mo
was rewritten for ~2× instructions and ~3.6× less GC at 1 KiB inputs; the techniques cost ~300 lines of inlined rounds and are only worth it for primitives that consumers depend on heavily.
必须同时满足以下所有条件才适合采用这种优化风格:
  1. 固定大小输入块由转换函数处理。算法指定一个N字节的块(例如SHA-256/RIPEMD-160为64字节),该块被反复输入到压缩函数中。
  2. 每个块包含多轮操作,每轮更新少量固定状态字。通常为64–160轮操作,针对4–8个
    Nat32
    /
    Nat64
    链变量。每轮路径中的元组/记录分配占主导成本。
  3. 公开流式API
    write
    /
    update
    +
    sum
    /
    finalize
    ),其中部分块必须在调用之间缓冲。
  4. 吞吐量至关重要。该函数位于可测量的热点路径中(例如签名、地址推导、大消息哈希)。
如果你的代码是一次性的、处理可变大小帧的,或者每个请求仅运行几次——停止优化。使用简洁易读的代码即可。参考:
src/Ripemd160.mo
经过重写后,1 KiB输入的指令数减少约2倍,GC开销减少约3.6倍;这些技术需要约300行内联轮次代码,仅对消费者高度依赖的原语才值得采用。

The Reference Pattern (research-ag/sha2 style)

参考模式(research-ag/sha2风格)

The combined patterns below are the same shape used by
research-ag/sha2
's SHA-256 and were applied successfully to
src/Ripemd160.mo
. Apply them together — partial application leaves most of the gains on the table.
以下组合模式与
research-ag/sha2
的SHA-256使用的模式相同,并成功应用于
src/Ripemd160.mo
。需组合应用这些模式——部分应用无法获得大部分收益。

Pattern 1 — Mutable chaining state in
[var Nat32]
(or
[var Nat64]
)

模式1 — 使用
[var Nat32]
(或
[var Nat64]
)存储可变链状态

motoko
// Persistent chaining state, single allocation reused across all blocks.
private let s : [var Nat32] = VarArray.repeat<Nat32>(0, 5);
Don't use individual
var
fields for the chaining variables when there are many of them — the array slot stores fixed-width words inline. Don't split into
Nat16
halves unless you measure a benefit; for ≤8 state words it's not worth it.
motoko
// 持久化链状态,单次分配可在所有块中复用。
private let s : [var Nat32] = VarArray.repeat<Nat32>(0, 5);
当存在多个链变量时,不要为每个变量使用单独的
var
字段——数组槽会直接存储固定宽度字。除非测量到收益,否则不要拆分为
Nat16
半字;对于≤8个状态字,这样做不值得。

Pattern 2 — Pre-decoded message schedule in
[var Nat32]
(NOT
[var Nat8]
)

模式2 — 使用
[var Nat32]
存储预解码消息调度(而非
[var Nat8]

motoko
// 16 little-endian (or big-endian) words for the current block.
// Bytes are folded in at write time; transform() reads words directly.
private let msg : [var Nat32] = VarArray.repeat<Nat32>(0, 16);
This replaces the common (and slow) pattern:
motoko
// AVOID for hot-path block hashing
private let buf : [var Nat8] = ...;          // 64-byte buffer
transform(buf.toArray(), 0);                  // allocates a 64-byte [Nat8] PER BLOCK
// transform then re-decodes 4 bytes → Nat32 word internally
The
[var Nat32]
design eliminates both the per-block
toArray()
copy and the read-side decoding inside
transform()
.
motoko
// 当前块的16个小端(或大端)字。
// 字节在写入时折叠;transform()直接读取字。
private let msg : [var Nat32] = VarArray.repeat<Nat32>(0, 16);
这替代了常见的(且缓慢的)模式:
motoko
// 热点路径块哈希应避免使用此模式
private let buf : [var Nat8] = ...;          // 64字节缓冲区
transform(buf.toArray(), 0);                  // 每个块分配一个64字节的[Nat8]
// transform随后在内部重新解码4字节→Nat32字
[var Nat32]
设计消除了每个块的
toArray()
复制和
transform()
内部的读取端解码操作。

Pattern 3 — Unboxed
Nat16
byte counter for partial-block position

模式3 — 使用未装箱的
Nat16
字节计数器记录部分块位置

motoko
// 0..63 byte position within the current block.
// Nat16 is unboxed in mutable storage; Nat is heap-allocated per increment.
private var i_msg : Nat16 = 0;
Use
Nat16
(or
Nat8
if the range fits) for any small mutable counter on a hot path —
var x : Nat = 0
allocates a boxed bignum on every assignment.
Nat64
is also unboxed but wastes register width when the value is provably small.
motoko
// 当前块内的0..63字节位置。
// Nat16在可变存储中是未装箱的;Nat每次递增都会在堆上分配。
private var i_msg : Nat16 = 0;
对热点路径上的任何小型可变计数器使用
Nat16
(或如果范围合适则使用
Nat8
)——
var x : Nat = 0
每次赋值都会分配一个装箱的大数。
Nat64
也是未装箱的,但当值可证明较小时会浪费寄存器宽度。

Pattern 4 —
writeByte
folds bytes into the word schedule

模式4 —
writeByte
将字节折叠到字调度中

motoko
private func writeByte(b : Nat8) {
  let pos = i_msg;
  let wi = Nat16.toNat(pos >> 2);
  let lane = pos & 0x3;
  let v : Nat32 = Nat32.fromNat16(b.toNat16()) << Nat32.fromNat16(lane << 3);
  if (lane == 0) { msg[wi] := v }            // first byte: overwrite stale word
  else { msg[wi] := msg[wi] | v };           // subsequent bytes: OR in
  let next = pos +% 1;
  if (next == 64) { transform(); n_blocks +%= 1; i_msg := 0 }
  else { i_msg := next };
};
Adjust the lane shift formula for big-endian schedules (e.g., SHA-2):
(3 - lane) << 3
.
motoko
private func writeByte(b : Nat8) {
  let pos = i_msg;
  let wi = Nat16.toNat(pos >> 2);
  let lane = pos & 0x3;
  let v : Nat32 = Nat32.fromNat16(b.toNat16()) << Nat32.fromNat16(lane << 3);
  if (lane == 0) { msg[wi] := v }            // 第一个字节:覆盖旧字
  else { msg[wi] := msg[wi] | v };           // 后续字节:按位或
  let next = pos +% 1;
  if (next == 64) { transform(); n_blocks +%= 1; i_msg := 0 }
  else { i_msg := next };
};
针对大端调度(例如SHA-2)调整通道移位公式:
(3 - lane) << 3

Pattern 5 — Fast-path block decoder in
write

模式5 —
write
中的快速路径块解码器

motoko
public func write(data : [Nat8]) {
  let n = data.size();
  if (n == 0) return;
  var i = 0;

  // (1) Finish any partial block one byte at a time.
  while (i_msg != 0 and i < n) { writeByte(data[i]); i += 1 };

  // (2) Fast path: decode 16 LE words inline directly from input → msg.
  while (i + 64 <= n) {
    msg[0] := data[i].toNat16().toNat32()
            | (data[i+1].toNat16().toNat32() <<  8)
            | (data[i+2].toNat16().toNat32() << 16)
            | (data[i+3].toNat16().toNat32() << 24);
    // ... msg[1] .. msg[15] (15 more identical lines)
    transform();
    n_blocks +%= 1;
    i += 64;
  };

  // (3) Tail: remaining < 64 bytes go into the partial block.
  while (i < n) { writeByte(data[i]); i += 1 };
};
Three loops, not one. The fast path is what makes large inputs cheap.
motoko
public func write(data : [Nat8]) {
  let n = data.size();
  if (n == 0) return;
  var i = 0;

  // (1) 逐个字节完成部分块。
  while (i_msg != 0 and i < n) { writeByte(data[i]); i += 1 };

  // (2) 快速路径:直接从输入将16个小端字解码到msg中。
  while (i + 64 <= n) {
    msg[0] := data[i].toNat16().toNat32()
            | (data[i+1].toNat16().toNat32() <<  8)
            | (data[i+2].toNat16().toNat32() << 16)
            | (data[i+3].toNat16().toNat32() << 24);
    // ... msg[1] .. msg[15](另外15行相同代码)
    transform();
    n_blocks +%= 1;
    i += 64;
  };

  // (3) 剩余数据:少于64字节的部分进入部分块。
  while (i < n) { writeByte(data[i]); i += 1 };
};
三个循环,而非一个。快速路径是处理大输入时成本低廉的关键。

Pattern 6 — Inline ALL rounds inside
transform()

模式6 — 将所有轮次内联到
transform()

The single biggest win for multi-round primitives. The original RIPEMD-160 code looked like:
motoko
// AVOID: each call allocates a (Nat32, Nat32) tuple — 320 tuples per block
let (a, c) = r11(a, b, c, d, e, msg, 0, 11);
let (e, b) = r11(e, a, b, c, d, msg, 1, 14);
// ... 158 more
Inline every round into a flat sequence of statement updates on local mutable vars:
motoko
var a1 : Nat32 = s[0]; var b1 : Nat32 = s[1]; /* ... */
// Left line round 1: f1(b,c,d) = b ^ c ^ d, K = 0
a1 := rol(a1 +% (b1 ^ c1 ^ d1) +% w0, 11) +% e1; c1 := rol(c1, 10);
e1 := rol(e1 +% (a1 ^ b1 ^ c1) +% w1, 14) +% d1; b1 := rol(b1, 10);
// ... 158 more, with K and rotation amounts varying per round
// Combine back to s without allocating temporaries
let t = s[0];
s[0] := s[1] +% c1 +% d2;
// ...
Yes, this is hundreds of lines. Yes, it's mechanically transcribed from the spec or the original tuple-returning helpers. Comment each round group with its
f
and
K
. The compiler does not currently inline these helpers automatically, and the tuple allocation per call is real.
这是多轮原语的最大优化点。原始RIPEMD-160代码如下:
motoko
// 避免使用:每次调用都会分配一个(Nat32, Nat32)元组——每个块320个元组
let (a, c) = r11(a, b, c, d, e, msg, 0, 11);
let (e, b) = r11(e, a, b, c, d, msg, 1, 14);
// ... 另外158行
将每一轮内联为对局部可变变量的扁平语句更新序列:
motoko
var a1 : Nat32 = s[0]; var b1 : Nat32 = s[1]; /* ... */
// 左线路轮次1:f1(b,c,d) = b ^ c ^ d, K = 0
a1 := rol(a1 +% (b1 ^ c1 ^ d1) +% w0, 11) +% e1; c1 := rol(c1, 10);
e1 := rol(e1 +% (a1 ^ b1 ^ c1) +% w1, 14) +% d1; b1 := rol(b1, 10);
// ... 另外158行,每轮的K和旋转量不同
// 将结果合并回s,不分配临时变量
let t = s[0];
s[0] := s[1] +% c1 +% d2;
// ...
是的,这需要数百行代码。是的,它是从规范或原始返回元组的辅助函数机械转录而来的。为每个轮次组添加注释说明其
f
K
。编译器目前不会自动内联这些辅助函数,每次调用的元组分配成本是真实存在的。

Pattern 7 — Boxing-free
Nat64
Nat8
for length encoding

模式7 — 无装箱的
Nat64
Nat8
长度编码

Padding writes the bit-length as 8 little-endian bytes. The naive
Nat8.fromNat(Nat64.toNat(x & 0xff))
round-trips through arbitrary-precision
Nat
, allocating per byte:
motoko
// Stage all narrowing on fixed-width types — no Nat allocation.
private func lowByte64(v : Nat64) : Nat8 {
  Nat8.fromNat16(Nat16.fromNat32(Nat32.fromNat64(v & 0xff)));
};
填充操作会将位长度写入8个小端字节。朴素的
Nat8.fromNat(Nat64.toNat(x & 0xff))
会通过任意精度的
Nat
进行往返转换,每个字节都会分配内存:
motoko
// 在固定宽度类型上分阶段缩窄——不分配Nat。
private func lowByte64(v : Nat64) : Nat8 {
  Nat8.fromNat16(Nat16.fromNat32(Nat32.fromNat64(v & 0xff)));
};

Pattern 8 — Padding via
writeByte
, not allocated
[var Nat8]

模式8 — 通过
writeByte
进行填充,而非分配
[var Nat8]

motoko
public func sum() : [Nat8] {
  let bitlen : Nat64 = ((n_blocks << 6) +% Nat64.fromNat(Nat16.toNat(i_msg))) << 3;
  writeByte(0x80);
  while (i_msg != 56) { writeByte(0) };
  writeByte(lowByte64(bitlen));
  writeByte(lowByte64(bitlen >> 8));
  // ... 6 more length bytes; the 8th triggers transform()
  // serialize state to 20/32 output bytes
};
This eliminates the typical
pad : [var Nat8]
and
sizedesc : [var Nat8]
allocations plus their
.toArray()
copies.
motoko
public func sum() : [Nat8] {
  let bitlen : Nat64 = ((n_blocks << 6) +% Nat64.fromNat(Nat16.toNat(i_msg))) << 3;
  writeByte(0x80);
  while (i_msg != 56) { writeByte(0) };
  writeByte(lowByte64(bitlen));
  writeByte(lowByte64(bitlen >> 8));
  // ... 另外6个长度字节;第8个会触发transform()
  // 将状态序列化为20/32输出字节
};
这消除了典型的
pad : [var Nat8]
sizedesc : [var Nat8]
分配及其
.toArray()
复制操作。

Anti-Patterns To Remove When Adopting This Style

采用此风格时需移除的反模式

  • var counter : Nat = 0
    on a per-byte path → use
    Nat16
    /
    Nat64
    .
  • Returning multi-value tuples
    (Nat32, Nat32)
    from per-round helpers.
  • Per-block
    buf.toArray()
    to hand bytes to
    transform
    .
  • Allocating
    pad
    and
    sizedesc
    arrays in
    sum
    /
    finalize
    .
  • Wrapping each round's bit operations with
    Common.readLE32(arr, i)
    calls when you can keep the words pre-decoded in
    [var Nat32]
    .
  • Going through
    Nat
    for any byte-level conversion in the hot path.
  • 逐字节路径上的
    var counter : Nat = 0
    → 使用
    Nat16
    /
    Nat64
  • 从每轮辅助函数返回多值元组
    (Nat32, Nat32)
  • 每个块调用
    buf.toArray()
    以将字节传递给
    transform
  • sum
    /
    finalize
    中分配
    pad
    sizedesc
    数组。
  • 当可以将字预解码到
    [var Nat32]
    中时,使用
    Common.readLE32(arr, i)
    调用包装每轮的位操作。
  • 在热点路径中通过
    Nat
    进行任何字节级转换。

Operator-Precedence Caution

运算符优先级注意事项

Motoko's
^
(XOR),
&
,
|
,
+%
,
<<
,
>>
precedences may not match C/Rust intuition. Aggressively parenthesize f-functions and round expressions:
motoko
a1 := rol(a1 +% ((b1 & c1) | (^b1 & d1)) +% w0 +% 0x5A827999, 11) +% e1;
Don't trust the compiler to group
^b1 & d1
as
(^b1) & d1
if you haven't checked. Adding parens costs nothing at runtime.
Motoko的
^
(异或)、
&
|
+%
<<
>>
优先级可能与C/Rust的直觉不符。积极添加括号包裹f函数和轮次表达式:
motoko
a1 := rol(a1 +% ((b1 & c1) | (^b1 & d1)) +% w0 +% 0x5A827999, 11) +% e1;
如果你未检查,不要信任编译器会将
^b1 & d1
分组为
(^b1) & d1
。添加括号在运行时不会产生任何成本。

Validation Workflow (Required)

验证流程(必需)

These optimizations are easy to mis-transcribe (160 round lines for RIPEMD-160 alone). Before claiming success:
  1. Full test suite must pass. Hash test vectors are non-negotiable — a single wrong rotation amount, K constant, or message-word index will corrupt all outputs but may still produce stable-looking bytes.
  2. Benchmark against the previous version. If instructions don't drop ≥1.5× and GC doesn't drop ≥2× at large inputs, something is wrong with the buffering or rounds are still allocating.
  3. Benchmark against an external reference if one exists (e.g., another mops package). Confirms you're not just shuffling cost around.
这些优化容易出现转录错误(仅RIPEMD-160就有160行轮次代码)。在宣称优化成功前:
  1. 必须通过完整测试套件。哈希测试向量是必不可少的——单个错误的旋转量、K常量或消息字索引会破坏所有输出,但仍可能产生看似稳定的字节。
  2. 与之前版本进行基准测试对比。如果大输入下指令数未减少≥1.5倍且GC开销未减少≥2倍,则缓冲或轮次仍存在分配问题。
  3. 与外部参考实现进行基准测试对比(如果存在,例如其他mops包)。确认你不是仅仅将成本转移到其他地方。

When NOT To Apply These Patterns

不适用场景

  • One-shot or low-frequency code. A 5-line hash call invoked once per request: leave the readable version alone.
  • Algorithms with variable-length blocks (e.g., compression, parsing). The
    msg : [var Nat32]
    schedule assumes fixed-width words at fixed positions.
  • Algorithms with few rounds (e.g.,
    < 16
    ). Inlining is overkill; the tuple-allocation cost is small relative to other work.
  • Code that needs to remain easy to audit against a spec. Inlined rounds are harder to read than a
    for r in rounds
    loop. For new primitives, write the clear version first, ship it, and only inline if benchmarks justify it.
  • Generic/parametric implementations. This style hard-codes the algorithm; you can't easily share
    transform()
    across hash variants.
  • 一次性或低频率代码。每个请求仅调用一次的5行哈希代码:保留易读版本即可。
  • 可变大小块算法(例如压缩、解析)。
    msg : [var Nat32]
    调度假设固定宽度字位于固定位置。
  • 轮次较少的算法(例如<16轮)。内联是过度优化;元组分配成本相对于其他工作来说很小。
  • 需要易于对照规范审计的代码。内联轮次比
    for r in rounds
    循环更难阅读。对于新原语,先编写清晰的版本,发布,仅当基准测试证明必要时再进行内联。
  • 通用/参数化实现。这种风格硬编码了算法;无法轻松在哈希变体之间共享
    transform()

Outcome (Reference: RIPEMD-160 in this repo)

优化成果(参考:本仓库中的RIPEMD-160)

Measured at 1024-byte input:
MetricBeforeAfterSpeedup
Instructions1,470,125734,8492.0×
GC traffic160.28 KiB43.88 KiB3.65×
Heap (steady)272 B272 B1.0×
Public API and observable behavior unchanged; all 24 test files pass.
针对1024字节输入的测量结果:
指标优化前优化后加速比
指令数1,470,125734,8492.0×
GC流量160.28 KiB43.88 KiB3.65×
堆内存(稳定)272 B272 B1.0×
公开API和可观察行为未改变;所有24个测试文件均通过。