otp-thinking

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OTP Thinking

OTP 思维

Paradigm shifts for OTP design. These insights challenge typical concurrency and state management patterns.
OTP 设计的范式转变。这些见解挑战了典型的并发和状态管理模式。

The Iron Law

铁律

GENSERVER IS A BOTTLENECK BY DESIGN
A GenServer processes ONE message at a time. Before creating one, ask:
  1. Do I actually need serialized access?
  2. Will this become a throughput bottleneck?
  3. Can reads bypass the GenServer via ETS?
The ETS pattern: GenServer owns ETS table, writes serialize through GenServer, reads bypass it entirely with
:read_concurrency
.
No exceptions: Don't wrap stateless functions in GenServer. Don't create GenServer "for organization".
GENSERVER IS A BOTTLENECK BY DESIGN
GenServer 一次仅处理一条消息。在创建它之前,请先问自己:
  1. 我真的需要序列化访问吗?
  2. 这会成为吞吐量瓶颈吗?
  3. 能否通过 ETS 绕过 GenServer 进行读取?
ETS 模式:GenServer 拥有 ETS 表,写入操作通过 GenServer 序列化执行,读取操作则完全通过
:read_concurrency
绕过它。
无例外:不要将无状态函数包装在 GenServer 中。不要为了“代码组织”而创建 GenServer。

GenServer Patterns

GenServer 模式

FunctionUse For
call/3
Synchronous requests expecting replies
cast/2
Fire-and-forget messages
When in doubt, use
call
to ensure back-pressure. Set appropriate timeouts for
call/3
.
Use
handle_continue/2
for post-init work—keeps
init/1
fast and non-blocking.
函数适用场景
call/3
需要返回结果的同步请求
cast/2
无需返回结果的“即发即弃”消息
拿不准时就用
call
,以确保背压机制。为
call/3
设置合适的超时时间。
使用
handle_continue/2
处理初始化后的工作——保持
init/1
快速且非阻塞。

Task.Supervisor, Not Task.async

使用 Task.Supervisor,而非 Task.async

Task.async
spawns a linked process—if task crashes, caller crashes too.
PatternOn task crash
Task.async/1
Caller crashes (linked, unsupervised)
Task.Supervisor.async/2
Caller crashes (linked, supervised)
Task.Supervisor.async_nolink/2
Caller survives, can handle error
Use Task.Supervisor for: Production code, graceful shutdown, observability,
async_nolink
. Use Task.async for: Quick experiments, scripts, when crash-together is acceptable.
Task.async
会生成一个链接进程——如果任务崩溃,调用者也会随之崩溃。
模式任务崩溃时的表现
Task.async/1
调用者崩溃(已链接、无监督)
Task.Supervisor.async/2
调用者崩溃(已链接、有监督)
Task.Supervisor.async_nolink/2
调用者存活,可处理错误
使用 Task.Supervisor 的场景:生产代码、优雅停机、可观测性、
async_nolink
使用 Task.async 的场景:快速实验、脚本、可接受“同崩溃”的情况。

DynamicSupervisor + Registry = Named Dynamic Processes

DynamicSupervisor + Registry = 命名动态进程

DynamicSupervisor only supports
:one_for_one
(dynamic children have no ordering). Use Registry for names—never create atoms dynamically:
elixir
defp via_tuple(id), do: {:via, Registry, {MyApp.Registry, id}}
PartitionSupervisor scales DynamicSupervisor for millions of children.
DynamicSupervisor 仅支持
:one_for_one
策略(动态子进程无顺序)。使用 Registry 来命名——切勿动态创建原子:
elixir
defp via_tuple(id), do: {:via, Registry, {MyApp.Registry, id}}
PartitionSupervisor 可扩展 DynamicSupervisor 以支持数百万个子进程。

:pg for Distributed, Registry for Local

:pg 用于分布式场景,Registry 用于本地场景

ToolScopeUse Case
RegistrySingle nodeNamed dynamic processes
:pgCluster-wideProcess groups, pub/sub
:pg
replaced deprecated
:pg2
. Horde provides distributed supervisor/registry with CRDTs.
工具作用范围适用场景
Registry单节点命名动态进程
:pg集群范围进程组、发布/订阅
:pg
已取代被弃用的
:pg2
Horde 提供基于 CRDT 的分布式监督器/注册表。

Broadway vs Oban: Different Problems

Broadway vs Oban:解决不同问题

ToolUse For
BroadwayExternal queues (SQS, Kafka, RabbitMQ) — data ingestion with batching
ObanBackground jobs with database persistence
Broadway is NOT a job queue.
工具适用场景
Broadway外部队列(SQS、Kafka、RabbitMQ)——带批处理的数据摄入
Oban具备数据库持久化的后台任务
Broadway 不是任务队列。

Broadway Gotchas

Broadway 注意事项

Processors are for runtime, not code organization. Dispatch to modules in
handle_message
, don't add processors for different message types.
one_for_all is for Broadway bugs, not your code. Your
handle_message
errors are caught and result in failed messages, not supervisor restarts.
Handle expected failures in the producer (connection loss, rate limits). Reserve max_restarts for unexpected bugs.
处理器用于运行时,而非代码组织。
handle_message
中分发到不同模块,不要为不同消息类型添加处理器。
one_for_all 策略用于处理 Broadway 自身的 bug,而非你的代码。 你的
handle_message
错误会被捕获并标记为失败消息,不会触发监督器重启。
在生产者中处理预期失败(连接丢失、速率限制)。将 max_restarts 留给意外的 bug。

Supervision Strategies Encode Dependencies

监督策略编码依赖关系

StrategyChildren Relationship
:one_for_oneIndependent
:one_for_allInterdependent (all restart)
:rest_for_oneSequential dependency
Use
:max_restarts
and
:max_seconds
to prevent restart loops.
Think about failure cascades BEFORE coding.
策略子进程关系
:one_for_one相互独立
:one_for_all相互依赖(全部重启)
:rest_for_one顺序依赖
使用
:max_restarts
:max_seconds
防止重启循环。
在编码前就考虑故障连锁反应。

Abstraction Decision Tree

抽象决策树

Need state?
├── No → Plain function
└── Yes → Complex behavior?
    ├── No → Agent
    └── Yes → Supervision?
        ├── No → spawn_link
        └── Yes → Request/response?
            ├── No → Task.Supervisor
            └── Yes → Explicit states?
                ├── No → GenServer
                └── Yes → GenStateMachine
需要状态?
├── 否 → 普通函数
└── 是 → 是否有复杂行为?
    ├── 否 → Agent
    └── 是 → 是否需要监督?
        ├── 否 → spawn_link
        └── 是 → 是否需要请求/响应?
            ├── 否 → Task.Supervisor
            └── 是 → 是否有明确状态?
                ├── 否 → GenServer
                └── 是 → GenStateMachine

Storage Options

存储选项

NeedUse
Memory cacheETS (
:read_concurrency
for reads)
Static config:persistent_term (faster than ETS)
Disk persistenceDETS (2GB limit)
Transactions/DistributionMnesia
需求工具
内存缓存ETS(读取时使用
:read_concurrency
静态配置:persistent_term(比 ETS 更快)
磁盘持久化DETS(2GB 限制)
事务/分布式Mnesia

:sys Debugs ANY OTP Process

:sys 可调试任意 OTP 进程

elixir
:sys.get_state(pid)        # Current state
:sys.trace(pid, true)      # Trace events (TURN OFF when done!)
elixir
:sys.get_state(pid)        # 当前状态
:sys.trace(pid, true)      # 跟踪事件(使用后务必关闭!)

Telemetry Is Built Into Everything

Telemetry 内置于所有组件

Phoenix, Ecto, and most libraries emit telemetry events. Attach handlers:
elixir
:telemetry.attach("my-handler", [:phoenix, :endpoint, :stop], &handle/4, nil)
Use
Telemetry.Metrics
+ reporters (StatsD, Prometheus, LiveDashboard).
Phoenix、Ecto 及大多数库都会发送 telemetry 事件。附加处理程序:
elixir
:telemetry.attach("my-handler", [:phoenix, :endpoint, :stop], &handle/4, nil)
使用
Telemetry.Metrics
+ 报告器(StatsD、Prometheus、LiveDashboard)。

Red Flags - STOP and Reconsider

危险信号 - 立即停止并重新考虑

  • GenServer wrapping stateless computation
  • Task.async in production when you need error handling
  • Creating atoms dynamically for process names
  • Single GenServer becoming throughput bottleneck
  • Using Broadway for background jobs (use Oban)
  • Using Oban for external queue consumption (use Broadway)
  • No supervision strategy reasoning
Any of these? Re-read The Iron Law and use the Abstraction Decision Tree.
  • 用 GenServer 包装无状态计算
  • 在需要错误处理的生产环境中使用 Task.async
  • 动态创建原子作为进程名称
  • 单个 GenServer 成为吞吐量瓶颈
  • 使用 Broadway 处理后台任务(应使用 Oban)
  • 使用 Oban 消费外部队列(应使用 Broadway)
  • 未对监督策略进行合理性考量
出现以上任何一种情况?请重新阅读《铁律》并使用抽象决策树。