fortify

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MANDATORY PREPARATION

必备准备工作

Invoke {{command_prefix}}agent-workflow — it contains workflow principles, anti-patterns, and the Context Gathering Protocol. Follow the protocol before proceeding — if no workflow context exists yet, you MUST run {{command_prefix}}teach-maestro first. Consult the guardrails-safety reference in the agent-workflow skill for defense-in-depth patterns and error boundary design.

Make the workflow resilient. Every external call will fail eventually — model APIs, tools, databases, third-party services. Fortify ensures the workflow handles failure gracefully.
调用 {{command_prefix}}agent-workflow —— 它包含工作流原则、反模式以及上下文收集协议。在继续操作前请遵循该协议,如果还不存在工作流上下文,你必须先运行 {{command_prefix}}teach-maestro。 请参考agent-workflow技能中的护栏安全参考,了解纵深防御模式和错误边界设计。

提升工作流的韧性。所有外部调用终归会出现故障——包括模型API、工具、数据库、第三方服务。强化机制可确保工作流优雅地处理故障。

Fortification Layers

强化分层

Layer 1: Input Validation
  • Validate all inputs before processing
  • Return clear error messages for invalid input
  • Set size limits on all input fields
Layer 2: Retry with Backoff For transient failures (network errors, rate limits, timeouts):
yaml
Retry strategy:
  max_retries: 3
  initial_delay: 1s
  backoff_multiplier: 2
  max_delay: 30s
  retryable_errors: [429, 500, 502, 503, 504, TIMEOUT, CONNECTION_ERROR]
  non_retryable_errors: [400, 401, 403, 404]
Layer 3: Fallback Responses When retries are exhausted:
  • Use a cached previous response (if applicable)
  • Use a simpler/cheaper model as fallback
  • Return a graceful degradation response
  • Escalate to human review
Layer 4: Circuit Breakers When a service is consistently failing:
yaml
Circuit breaker:
  failure_threshold: 5 consecutive failures
  state: CLOSED → OPEN (after threshold) → HALF_OPEN (after cooldown)
  cooldown: 60 seconds
  half_open_max_requests: 1
Layer 5: Timeout Controls Every external call needs a timeout:
  • Model API calls: 30-120s depending on task
  • Tool executions: 10-60s depending on tool
  • Database queries: 5-15s
  • Third-party APIs: 10-30s
第一层:输入校验
  • 处理前校验所有输入
  • 针对无效输入返回清晰的错误信息
  • 为所有输入字段设置大小限制
第二层:带退避的重试 针对瞬时故障(网络错误、速率限制、超时):
yaml
Retry strategy:
  max_retries: 3
  initial_delay: 1s
  backoff_multiplier: 2
  max_delay: 30s
  retryable_errors: [429, 500, 502, 503, 504, TIMEOUT, CONNECTION_ERROR]
  non_retryable_errors: [400, 401, 403, 404]
第三层:降级响应 当重试次数耗尽时:
  • 适用情况下使用缓存的历史响应
  • 使用更简单/成本更低的模型作为降级方案
  • 返回优雅降级的响应
  • 流转至人工审核
第四层:熔断机制 当服务持续故障时:
yaml
Circuit breaker:
  failure_threshold: 5 consecutive failures
  state: CLOSED → OPEN (after threshold) → HALF_OPEN (after cooldown)
  cooldown: 60 seconds
  half_open_max_requests: 1
第五层:超时控制 所有外部调用都需要设置超时:
  • 模型API调用:根据任务不同设置30-120秒
  • 工具执行:根据工具不同设置10-60秒
  • 数据库查询:5-15秒
  • 第三方API:10-30秒

Fortification Audit

强化审计

For each component, verify:
  • Input validation present
  • Retry logic for transient failures
  • Fallback for when retries fail
  • Timeout set
  • Error logged with context
  • User gets a meaningful error (not a stack trace)
针对每个组件,验证以下项:
  • 已配置输入校验
  • 已配置瞬时故障的重试逻辑
  • 重试失败时存在降级方案
  • 已设置超时
  • 错误已附带上下文日志
  • 用户会收到有意义的错误提示(而非堆栈信息)

Recommended Next Step

建议后续步骤

After fortification, run
{{command_prefix}}evaluate
to verify error handling works under realistic failure scenarios.
NEVER:
  • Retry non-retryable errors (authentication failures, validation errors)
  • Retry without backoff (you'll make the problem worse)
  • Swallow errors silently (log and handle, don't ignore)
  • Set infinite timeouts (they'll hang forever)
  • Skip the fallback (retries exhausted with no fallback = user sees an error)
强化完成后,运行
{{command_prefix}}evaluate
来验证错误处理在真实故障场景下的可用性。
禁止操作
  • 重试不可重试的错误(鉴权失败、校验错误)
  • 无退避逻辑的重试(会加剧故障)
  • 静默吞掉错误(需记录日志并处理,不要忽略)
  • 设置无限超时(会导致进程永久挂起)
  • 省略降级方案(重试耗尽且无降级=用户会看到报错)