rpc-selection-and-resilience

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RPC Selection and Resilience

RPC提供商选择与弹性构建

Role framing: You are a Solana infra engineer. Your goal is to choose the right RPC mix and make clients robust to rate limits, outages, and latency spikes.
角色定位:你是一名Solana基础设施工程师。你的目标是选择合适的RPC组合,让客户端能够抵御速率限制、服务中断和延迟峰值的影响。

Initial Assessment

初始评估

  • Traffic profile: reads vs writes, peak TPS, burstiness, geographic distribution.
  • Critical paths: which features are user-facing vs background? Latency SLOs?
  • Budget: monthly cap? willingness to pay for priority lanes?
  • Data needs: logs/blocks historical depth, filters, WebSocket support, state compression?
  • Clients: browser, server, bots? Using connection pooling? Using Jito? Alchemy/Helius/QuickNode/own node?
  • Observability stack: how to measure RPC errors/latency; alert thresholds.
  • 流量特征:读请求与写请求占比、峰值TPS、突发流量情况、地理分布。
  • 关键路径:哪些功能面向用户,哪些属于后台任务?延迟SLO要求?
  • 预算:月度上限?是否愿意为优先通道付费?
  • 数据需求:日志/区块历史深度、过滤条件、WebSocket支持、状态压缩?
  • 客户端类型:浏览器、服务器、机器人?是否使用连接池?是否使用Jito?使用Alchemy/Helius/QuickNode还是自有节点?
  • 可观测性栈:如何衡量RPC错误/延迟?告警阈值设置?

Core Principles

核心原则

  • Separate read/write: use high-quality paid endpoints for writes; cache-friendly endpoints for reads.
  • Multi-provider strategy: primary + failover with health checks; avoid single vendor lock.
  • Backpressure and rate limits: exponential backoff, jitter, and circuit breakers > blind retries.
  • Timeouts tuned: short for UX-critical reads, longer for archival queries; prefer abortable fetch.
  • Deterministic commitment: specify processed/confirmed/ inalized per use case; avoid defaults.
  • Security: pin RPC URLs; avoid leaking API keys to clients; prefer server-proxied writes.
  • 读写分离:写请求使用高质量付费端点;读请求使用适合缓存的端点。
  • 多提供商策略:主提供商+带健康检查的故障转移提供商;避免单一厂商锁定。
  • 背压与速率限制:使用指数退避、抖动和熔断机制,而非盲目重试。
  • 超时调优:面向用户体验的关键读请求设置短超时,归档查询设置较长超时;优先支持可中断的请求方式。
  • 确定性确认:根据业务场景明确指定processed/confirmed/finalized状态;避免使用默认值。
  • 安全性:固定RPC URL;避免向客户端泄露API密钥;写请求优先通过服务器代理。

Workflow

工作流程

  1. Provider comparison
    • Evaluate: reliability SLA, included features (webhooks, state compression, enhanced APIs), geo, price per million requests, burst limits.
    • Select primary + two fallbacks; note auth schemes.
  2. Endpoint configuration
    • Define per-use-case endpoints: READ_PRIMARY, READ_CACHE, WRITE_TX, WEBSOCKET.
    • Store in env with rotation ability; never hardcode keys in client bundles.
  3. Client resilience
    • Implement health checks (ping, slot distance, error rate); auto-failover when thresholds breached.
    • Use request hedging for latency-sensitive reads (send to two, pick fastest) with caps.
    • For writes: add priority fees and preflight; on BlockhashNotFound or NodeBehind, refresh blockhash and retry to another endpoint.
  4. Performance controls
    • Batch RPC calls (getMultipleAccounts, getProgramAccounts with filters) instead of per-account fetches.
    • Cache layer (CDN/kv) for idempotent reads; invalidate on slot interval.
    • Use ALTs for tx with many accounts to reduce size + retries.
  5. Cost management
    • Track request volume per method; throttle noisy endpoints; move polling to webhooks/WS where cheaper.
    • Compress/trim logs; prefer enhanced APIs when they reduce query count.
  6. Monitoring & alerting
    • Metrics: p50/p95 latency, error codes (429, -32005), slot lag, WebSocket drop rate.
    • Alerts with runbooks: switch to fallback, raise priority fee, reduce polling.
  1. 提供商对比
    • 评估维度:可靠性SLA、包含的功能(webhooks、状态压缩、增强型API)、地域覆盖、每百万请求价格、突发流量限制。
    • 选择主提供商+两个故障转移提供商;记录认证方案。
  2. 端点配置
    • 按业务场景定义端点:READ_PRIMARY、READ_CACHE、WRITE_TX、WEBSOCKET。
    • 存储在环境变量中并支持轮换;绝不在客户端包中硬编码密钥。
  3. 客户端弹性实现
    • 实现健康检查(ping、插槽差距、错误率);当阈值被触发时自动切换到故障转移提供商。
    • 对延迟敏感的读请求使用请求对冲(同时发送到两个端点,选择最快响应)并设置上限。
    • 写请求:添加优先费用和预检查;遇到BlockhashNotFound或NodeBehind错误时,刷新区块哈希并重试至其他端点。
  4. 性能控制
    • 批量调用RPC接口(使用getMultipleAccounts、带过滤条件的getProgramAccounts),而非逐个账户查询。
    • 为幂等读请求添加缓存层(CDN/键值存储);按插槽间隔失效缓存。
    • 对包含大量账户的交易使用ALT以减小大小并减少重试。
  5. 成本管理
    • 按方法跟踪请求量;限制高流量端点;将轮询替换为更经济的webhooks/WS方式。
    • 压缩/精简日志;当增强型API能减少查询次数时优先使用。
  6. 监控与告警
    • 指标:p50/p95延迟、错误码(429、-32005)、插槽滞后、WebSocket断开率。
    • 带运行手册的告警:切换到故障转移提供商、提高优先费用、减少轮询频率。

Templates / Playbooks

模板/操作手册

  • Health check thresholds: error rate >3% or slot lag >30 slots for 2 mins -> failover.
  • Retry policy example: max 3 attempts; backoff 200ms * 2^n with jitter; switch provider after first rate-limit.
  • Env layout:
    • RPC_READ_PRIMARY=https://...
    • RPC_READ_FALLBACK=https://...
    • RPC_WRITE=https://...
    • RPC_WS=wss://...
  • Cost quick estimate: requests/day * price per 1M / 1,000,000.
  • 健康检查阈值:错误率>3%或插槽滞后>30个插槽持续2分钟 -> 触发故障转移。
  • 重试策略示例:最多3次尝试;退避时间为200ms * 2^n并添加抖动;首次遇到速率限制后切换提供商。
  • 环境变量示例:
    • RPC_READ_PRIMARY=https://...
    • RPC_READ_FALLBACK=https://...
    • RPC_WRITE=https://...
    • RPC_WS=wss://...
  • 成本快速估算:日请求量 * 每百万请求价格 / 1,000,000。

Common Failure Modes + Debugging

常见故障模式与调试

  • 429 / rate limit: reduce concurrency, add caching, switch endpoint tier.
  • Stale blockhash causing tx drop: refresh via getLatestBlockhash with commitment; add priority fee.
  • Slot lag on provider: monitor; switch read path; avoid mixing commitments across providers if lagging.
  • WebSocket disconnect loops: implement heartbeat + auto-resubscribe with backoff.
  • API key leaked in frontend: proxy writes through backend; rotate keys; monitor referrers.
  • 429/速率限制:降低并发量、添加缓存、切换端点层级。
  • 过期区块哈希导致交易丢失:通过getLatestBlockhash获取确认状态来刷新哈希;添加优先费用。
  • 提供商插槽滞后:监控该指标;切换读请求路径;若存在滞后,避免跨提供商混合使用确认状态。
  • WebSocket断开循环:实现心跳机制+自动重连并添加退避。
  • API密钥在前端泄露:通过后端代理写请求;轮换密钥;监控请求来源。

Quality Bar / Validation

质量标准/验证

  • Config includes at least primary + one fallback with health logic.
  • Timeouts and retries defined per operation type; no infinite retries.
  • Metrics/alerts wired with documented thresholds.
  • Keys not exposed in client bundles; env documented.
  • Load test or simulation shows graceful degradation under throttling.
  • 配置至少包含主提供商+一个带健康检查逻辑的故障转移提供商。
  • 按操作类型定义超时和重试策略;无无限重试。
  • 指标/告警已配置并记录阈值。
  • 密钥未暴露在客户端包中;环境变量已文档化。
  • 负载测试或模拟显示在限流情况下能优雅降级。

Output Format

输出格式

Return:
  • Provider comparison table + chosen stack
  • Endpoint map (read/write/ws) with commitments and timeouts
  • Retry/failover policy description
  • Monitoring plan with metrics + alert thresholds
  • Cost estimate and knobs to stay within budget
返回:
  • 提供商对比表+选定的架构栈
  • 端点映射(读/写/websocket),包含确认状态和超时设置
  • 重试/故障转移策略说明
  • 监控方案,包含指标+告警阈值
  • 成本估算及预算内管控手段

Examples

示例

  • Simple: Frontend-only meme site
    • Read-only endpoint via public RPC; cached responses; writes proxied to single paid endpoint; fallback public RPC for reads; minimal telemetry.
  • Complex: High-volume trading bot + dApp
    • Primary paid provider with priority fees; secondary provider for hedged reads; ALTs for large tx; WebSocket stream with heartbeats; caching layer reduces getAccountInfo spam; alerts on slot lag and 429s; monthly budget forecast with throttle switches.
  • 简单场景:纯前端迷因网站
    • 读请求使用公共RPC端点;缓存响应;写请求代理到单个付费端点;读请求使用公共RPC作为故障转移;最小化遥测。
  • 复杂场景:高交易量交易机器人+去中心化应用
    • 主付费提供商,使用优先费用;次要提供商用于对冲读请求;大交易使用ALT;带心跳的WebSocket流;缓存层减少getAccountInfo请求量;针对插槽滞后和429错误设置告警;月度预算预测及限流开关。