tailslayer-dram-hedged-reads

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Tailslayer — DRAM Hedged Read Library

Tailslayer — DRAM对冲读取库

Skill by ara.so — Daily 2026 Skills collection.
Tailslayer is a C++ library that reduces tail latency in RAM reads caused by DRAM refresh stalls. It replicates data across multiple independent DRAM channels with uncorrelated refresh schedules, issues hedged reads across all replicas simultaneously, and returns whichever result responds first — eliminating worst-case stall spikes from DRAM refresh cycles.
Works on AMD, Intel, and AWS Graviton using undocumented channel scrambling offsets.

ara.so提供的Skill — 2026年度技能合集。
Tailslayer是一款C++库,用于降低由DRAM刷新停滞导致的RAM读取尾部延迟。它会将数据复制到多个采用非关联刷新调度的独立DRAM通道中,同时向所有副本发起对冲读取,返回最先响应的结果,从而消除DRAM刷新周期带来的最坏情况停滞峰值。
该库借助未公开的通道加扰偏移量,可在AMD、Intel和AWS Graviton平台上运行。

How It Works

工作原理

  • Data is replicated N times, each copy placed on a different DRAM channel
  • Each replica is monitored by a worker pinned to a separate CPU core
  • When a read is triggered (via your signal function), all replicas are read simultaneously
  • Whichever channel responds first wins; the result is passed to your work function
  • DRAM refresh on one channel cannot stall all channels simultaneously → tail latency is eliminated

  • 数据会被复制N份,每份副本存放在不同的DRAM通道上
  • 每个副本由绑定在独立CPU核心上的工作线程监控
  • 当读取被触发(通过你定义的信号函数)时,所有副本会被同时读取
  • 最先响应的通道胜出,结果会被传递到你的工作函数中
  • 单通道的DRAM刷新不会同时阻塞所有通道 → 尾部延迟被消除

Installation

安装

Copy the header into your project

复制头文件到你的项目

bash
git clone https://github.com/LaurieWired/tailslayer.git
cp -r tailslayer/include/tailslayer /your/project/include/
bash
git clone https://github.com/LaurieWired/tailslayer.git
cp -r tailslayer/include/tailslayer /your/project/include/

Include in your code

在代码中引入

cpp
#include <tailslayer/hedged_reader.hpp>
cpp
#include <tailslayer/hedged_reader.hpp>

Build the provided example

编译提供的示例代码

bash
git clone https://github.com/LaurieWired/tailslayer.git
cd tailslayer
make
./tailslayer_example

bash
git clone https://github.com/LaurieWired/tailslayer.git
cd tailslayer
make
./tailslayer_example

Key API

核心API

tailslayer::HedgedReader<T, SignalFn, WorkFn, SignalArgs, WorkArgs>

tailslayer::HedgedReader<T, SignalFn, WorkFn, SignalArgs, WorkArgs>

Template parameters:
ParameterDescription
T
Value type stored and read
SignalFn
Function that waits for a trigger and returns the index to read
WorkFn
Function called with the value immediately after read
SignalArgs
(optional)
tailslayer::ArgList<...>
of compile-time args to signal function
WorkArgs
(optional)
tailslayer::ArgList<...>
of compile-time args to work function
模板参数:
参数描述
T
存储和读取的值的类型
SignalFn
等待触发信号并返回待读取索引的函数
WorkFn
读取完成后立即接收值并调用的函数
SignalArgs
(可选)传递给信号函数的编译期参数,格式为
tailslayer::ArgList<...>
WorkArgs
(可选)传递给工作函数的编译期参数,格式为
tailslayer::ArgList<...>

Constructor optional parameters

构造函数可选参数

cpp
HedgedReader(
    uint64_t channel_offset = DEFAULT_OFFSET,  // undocumented channel scrambling offset
    uint64_t channel_bit    = DEFAULT_BIT,     // bit used for channel selection
    std::size_t n_replicas  = 2                // number of DRAM channel replicas
)
cpp
HedgedReader(
    uint64_t channel_offset = DEFAULT_OFFSET,  // 未公开的通道加扰偏移量
    uint64_t channel_bit    = DEFAULT_BIT,     // 用于通道选择的比特位
    std::size_t n_replicas  = 2                // DRAM通道副本数量
)

Methods

方法

cpp
reader.insert(T value);       // Insert value, replicated across all channels
reader.start_workers();       // Launch per-channel worker threads (blocking)
cpp
reader.insert(T value);       // 插入值,会自动复制到所有通道
reader.start_workers();       // 启动每个通道对应的工作线程(阻塞方法)

Utilities

工具方法

cpp
tailslayer::pin_to_core(core_id);        // Pin calling thread to a specific core
tailslayer::CORE_MAIN                    // Constant: recommended core for main thread

cpp
tailslayer::pin_to_core(core_id);        // 将调用线程绑定到指定CPU核心
tailslayer::CORE_MAIN                    // 常量:推荐给主线程使用的核心

Minimal Usage Pattern

最简使用示例

cpp
#include <tailslayer/hedged_reader.hpp>
#include <cstdint>
#include <cstdio>

// 1. Define your signal function — waits for your event, returns index to read
[[gnu::always_inline]] inline std::size_t my_signal() {
    // Example: busy-wait for an external flag, then return the index
    extern volatile std::size_t g_index;
    extern volatile bool g_trigger;
    while (!g_trigger) {}
    g_trigger = false;
    return g_index;
}

// 2. Define your work function — receives the read value immediately
template <typename T>
[[gnu::always_inline]] inline void my_work(T val) {
    // Process val as fast as possible
    printf("Read value: %u\n", (unsigned)val);
}

int main() {
    using T = uint8_t;

    // Pin main thread to recommended core
    tailslayer::pin_to_core(tailslayer::CORE_MAIN);

    // Construct reader with 2 replicas (default)
    tailslayer::HedgedReader<T, my_signal, my_work<T>> reader{};

    // Insert data — replicated across both DRAM channels automatically
    reader.insert(0x43);
    reader.insert(0x44);

    // Launch workers — blocks; workers spin until signal fires
    reader.start_workers();

    return 0;
}

cpp
#include <tailslayer/hedged_reader.hpp>
#include <cstdint>
#include <cstdio>

// 1. 定义你的信号函数 — 等待你的事件触发,返回待读取的索引
[[gnu::always_inline]] inline std::size_t my_signal() {
    // 示例:忙等待外部标记,然后返回索引
    extern volatile std::size_t g_index;
    extern volatile bool g_trigger;
    while (!g_trigger) {}
    g_trigger = false;
    return g_index;
}

// 2. 定义你的工作函数 — 读取完成后立即接收读取到的值
template <typename T>
[[gnu::always_inline]] inline void my_work(T val) {
    // 尽快处理val
    printf("Read value: %u\n", (unsigned)val);
}

int main() {
    using T = uint8_t;

    // 将主线程绑定到推荐核心
    tailslayer::pin_to_core(tailslayer::CORE_MAIN);

    // 构造读取器,默认使用2个副本
    tailslayer::HedgedReader<T, my_signal, my_work<T>> reader{};

    // 插入数据 — 自动复制到两个DRAM通道
    reader.insert(0x43);
    reader.insert(0x44);

    // 启动工作线程 — 会阻塞,工作线程会自旋等待信号触发
    reader.start_workers();

    return 0;
}

Passing Arguments to Signal and Work Functions

向信号函数和工作函数传递参数

Use
tailslayer::ArgList<...>
to pass compile-time integer arguments:
cpp
#include <tailslayer/hedged_reader.hpp>

// Signal function with args
[[gnu::always_inline]] inline std::size_t my_signal(int threshold, int channel) {
    // use threshold and channel...
    return 0;
}

// Work function with args
template <typename T>
[[gnu::always_inline]] inline void my_work(T val, int multiplier) {
    volatile int result = (int)val * multiplier;
    (void)result;
}

int main() {
    using T = uint8_t;
    tailslayer::pin_to_core(tailslayer::CORE_MAIN);

    tailslayer::HedgedReader<
        T,
        my_signal,
        my_work<T>,
        tailslayer::ArgList<10, 1>,   // args forwarded to my_signal: threshold=10, channel=1
        tailslayer::ArgList<2>        // args forwarded to my_work:   multiplier=2
    > reader{};

    reader.insert(0xAB);
    reader.start_workers();
}

使用
tailslayer::ArgList<...>
传递编译期整数参数:
cpp
#include <tailslayer/hedged_reader.hpp>

// 带参数的信号函数
[[gnu::always_inline]] inline std::size_t my_signal(int threshold, int channel) {
    // 使用threshold和channel参数...
    return 0;
}

// 带参数的工作函数
template <typename T>
[[gnu::always_inline]] inline void my_work(T val, int multiplier) {
    volatile int result = (int)val * multiplier;
    (void)result;
}

int main() {
    using T = uint8_t;
    tailslayer::pin_to_core(tailslayer::CORE_MAIN);

    tailslayer::HedgedReader<
        T,
        my_signal,
        my_work<T>,
        tailslayer::ArgList<10, 1>,   // 传递给my_signal的参数: threshold=10, channel=1
        tailslayer::ArgList<2>        // 传递给my_work的参数:   multiplier=2
    > reader{};

    reader.insert(0xAB);
    reader.start_workers();
}

Custom Channel Configuration

自定义通道配置

Override channel offset, channel bit, and replica count in the constructor:
cpp
// Example: 4 replicas, custom channel bit 8 (common for AMD/Intel)
tailslayer::HedgedReader<T, my_signal, my_work<T>> reader{
    /* channel_offset */ 0,
    /* channel_bit    */ 8,
    /* n_replicas     */ 4
};
Note: N-way (more than 2 replicas) hedging requires using the benchmark code in
discovery/benchmark/
. The main library header currently exposes 2 channels by default.

在构造函数中覆盖通道偏移、通道比特位和副本数量:
cpp
// 示例:4个副本,自定义通道比特位为8(AMD/Intel平台常见配置)
tailslayer::HedgedReader<T, my_signal, my_work<T>> reader{
    /* channel_offset */ 0,
    /* channel_bit    */ 8,
    /* n_replicas     */ 4
};
注意: N路(超过2个副本)对冲读取需要使用
discovery/benchmark/
下的基准测试代码。当前主库头文件默认仅暴露2个通道。

Running Benchmarks

运行基准测试

Channel-hedged read benchmark (N-way)

通道对冲读取基准测试(N路)

bash
cd discovery/benchmark
make
sudo chrt -f 99 ./hedged_read_cpp --all --channel-bit 8
Flags:
FlagDescription
--all
Run all channel configurations
--channel-bit N
Specify the DRAM channel selection bit (try 6, 7, or 8 for your platform)
bash
cd discovery/benchmark
make
sudo chrt -f 99 ./hedged_read_cpp --all --channel-bit 8
参数说明:
参数描述
--all
运行所有通道配置
--channel-bit N
指定DRAM通道选择比特位(你的平台可以尝试6、7或8)

DRAM refresh spike timing probe

DRAM刷新峰值计时探测工具

bash
cd discovery
gcc -O2 -o trefi_probe trefi_probe.c
sudo ./trefi_probe
This measures your DRAM's tREFI refresh interval and the worst-case stall duration — useful for calibrating expectations.

bash
cd discovery
gcc -O2 -o trefi_probe trefi_probe.c
sudo ./trefi_probe
该工具会测量你的DRAM的tREFI刷新间隔和最坏情况停滞时长,可用于校准预期性能。

Platform Notes

平台说明

PlatformTypical Channel BitNotes
AMD (Zen)6 or 7Verify with benchmark
Intel6, 7, or 8Run benchmark with
--all
AWS Graviton8Confirmed working
Use
--all
in the benchmark to auto-detect the best channel bit for your system.

平台典型通道比特位说明
AMD (Zen)6或7请用基准测试验证
Intel6、7或8
--all
参数运行基准测试
AWS Graviton8已验证可正常运行
在基准测试中使用
--all
参数可自动探测你系统的最佳通道比特位。

Common Patterns

常见使用场景

Low-latency trading / event-driven read

低延迟交易/事件驱动读取

cpp
// Pre-load order book prices into hedged reader
// Signal on market data arrival, process immediately

[[gnu::always_inline]] inline std::size_t await_market_signal() {
    extern volatile std::size_t g_book_idx;
    extern volatile bool g_tick;
    while (!g_tick) { __builtin_ia32_pause(); }
    g_tick = false;
    return g_book_idx;
}

template <typename T>
[[gnu::always_inline]] inline void process_price(T price) {
    // Submit order using price with minimal latency
    extern void submit_order(T);
    submit_order(price);
}

int main() {
    tailslayer::pin_to_core(tailslayer::CORE_MAIN);
    tailslayer::HedgedReader<uint64_t, await_market_signal, process_price<uint64_t>> reader{};
    for (uint64_t price : preloaded_prices) {
        reader.insert(price);
    }
    reader.start_workers();
}
cpp
// 预先将订单簿价格加载到对冲读取器中
// 市场数据到达时触发信号,立即处理

[[gnu::always_inline]] inline std::size_t await_market_signal() {
    extern volatile std::size_t g_book_idx;
    extern volatile bool g_tick;
    while (!g_tick) { __builtin_ia32_pause(); }
    g_tick = false;
    return g_book_idx;
}

template <typename T>
[[gnu::always_inline]] inline void process_price(T price) {
    // 以最低延迟使用价格提交订单
    extern void submit_order(T);
    submit_order(price);
}

int main() {
    tailslayer::pin_to_core(tailslayer::CORE_MAIN);
    tailslayer::HedgedReader<uint64_t, await_market_signal, process_price<uint64_t>> reader{};
    for (uint64_t price : preloaded_prices) {
        reader.insert(price);
    }
    reader.start_workers();
}

Preloading a lookup table across channels

跨通道预加载查询表

cpp
// Each insert automatically maps to correct DRAM channel via address calculation
// Access is via logical index — tailslayer manages physical placement

tailslayer::HedgedReader<uint32_t, my_signal, my_work<uint32_t>> reader{};

std::vector<uint32_t> lut = {100, 200, 300, 400};
for (auto v : lut) {
    reader.insert(v);
}
reader.start_workers();

cpp
// 每次插入会通过地址计算自动映射到正确的DRAM通道
// 通过逻辑索引访问 — tailslayer会管理物理存放位置

tailslayer::HedgedReader<uint32_t, my_signal, my_work<uint32_t>> reader{};

std::vector<uint32_t> lut = {100, 200, 300, 400};
for (auto v : lut) {
    reader.insert(v);
}
reader.start_workers();

Troubleshooting

故障排查

High latency still observed

仍观察到高延迟

  • Verify you are using the correct
    --channel-bit
    for your CPU. Run benchmark with
    --all
    .
  • Ensure workers are pinned to isolated cores (use
    isolcpus=
    kernel boot parameter).
  • Run with real-time scheduling:
    sudo chrt -f 99 ./your_binary
  • 确认你使用了适配你CPU的正确
    --channel-bit
    参数,带
    --all
    参数运行基准测试。
  • 确保工作线程绑定到了隔离核心(使用
    isolcpus=
    内核启动参数)。
  • 使用实时调度运行:
    sudo chrt -f 99 ./your_binary

Build errors — missing headers

编译错误 — 缺少头文件

  • Confirm
    include/tailslayer/hedged_reader.hpp
    is on your include path.
  • Requires C++17 or later: add
    -std=c++17
    to your compiler flags.
  • 确认
    include/tailslayer/hedged_reader.hpp
    在你的头文件搜索路径中。
  • 需要C++17或更高版本:在你的编译参数中添加
    -std=c++17

Workers don't start / deadlock

工作线程未启动/死锁

  • start_workers()
    is blocking. It launches threads and waits — your signal function must eventually return.
  • Ensure the signal function does not block indefinitely during testing.
  • start_workers()
    是阻塞方法,它会启动线程并等待 — 你的信号函数必须最终返回。
  • 测试时确保信号函数不会无限期阻塞。

Data corruption / wrong values

数据损坏/值错误

  • Each
    insert()
    replicates the value N times (one per channel). Logical indexing is handled internally — do not attempt to address replicas directly.
  • Do not modify inserted data after
    insert()
    is called.
  • 每次
    insert()
    会将值复制N份(每个通道一份),逻辑索引由内部处理 — 不要尝试直接访问副本。
  • insert()
    调用完成后不要修改已插入的数据。

Platform not supported

平台不受支持

  • Tailslayer uses undocumented DRAM channel scrambling offsets. If your platform is not AMD, Intel, or Graviton, run the trefi_probe and benchmark tools to characterize refresh behavior before using the library in production.

  • Tailslayer使用了未公开的DRAM通道加扰偏移量。如果你的平台不是AMD、Intel或Graviton,请先运行trefi_probe和基准测试工具分析刷新行为,再在生产环境使用该库。

Project Structure

项目结构

tailslayer/
├── include/tailslayer/
│   └── hedged_reader.hpp       # Main library header (copy this)
├── tailslayer_example.cpp      # Usage example
├── discovery/
│   ├── trefi_probe.c           # DRAM refresh spike timing tool
│   └── benchmark/              # N-way channel hedging benchmark
└── Makefile
tailslayer/
├── include/tailslayer/
│   └── hedged_reader.hpp       # 主库头文件(复制这个文件即可)
├── tailslayer_example.cpp      # 使用示例
├── discovery/
│   ├── trefi_probe.c           # DRAM刷新峰值计时工具
│   └── benchmark/              # N路通道对冲基准测试
└── Makefile