Translation Comparison - fortify

Invoke {{command_prefix}}agent-workflow — it contains workflow principles, anti-patterns, and the Context Gathering Protocol. Follow the protocol before proceeding — if no workflow context exists yet, you MUST run {{command_prefix}}teach-maestro first. Consult the guardrails-safety reference in the agent-workflow skill for defense-in-depth patterns and error boundary design.

Make the workflow resilient. Every external call will fail eventually — model APIs, tools, databases, third-party services. Fortify ensures the workflow handles failure gracefully.

调用 {{command_prefix}}agent-workflow —— 它包含工作流原则、反模式以及上下文收集协议。在继续操作前请遵循该协议，如果还不存在工作流上下文，你必须先运行 {{command_prefix}}teach-maestro。请参考agent-workflow技能中的护栏安全参考，了解纵深防御模式和错误边界设计。

提升工作流的韧性。所有外部调用终归会出现故障——包括模型API、工具、数据库、第三方服务。强化机制可确保工作流优雅地处理故障。

Fortification Layers

强化分层

Layer 1: Input Validation

Validate all inputs before processing
Return clear error messages for invalid input
Set size limits on all input fields

Layer 2: Retry with Backoff For transient failures (network errors, rate limits, timeouts):

yaml

Retry strategy:
  max_retries: 3
  initial_delay: 1s
  backoff_multiplier: 2
  max_delay: 30s
  retryable_errors: [429, 500, 502, 503, 504, TIMEOUT, CONNECTION_ERROR]
  non_retryable_errors: [400, 401, 403, 404]

Layer 3: Fallback Responses When retries are exhausted:

Use a cached previous response (if applicable)
Use a simpler/cheaper model as fallback
Return a graceful degradation response
Escalate to human review

Layer 4: Circuit Breakers When a service is consistently failing:

yaml

Circuit breaker:
  failure_threshold: 5 consecutive failures
  state: CLOSED → OPEN (after threshold) → HALF_OPEN (after cooldown)
  cooldown: 60 seconds
  half_open_max_requests: 1

Layer 5: Timeout Controls Every external call needs a timeout:

Model API calls: 30-120s depending on task
Tool executions: 10-60s depending on tool
Database queries: 5-15s
Third-party APIs: 10-30s

第一层：输入校验

处理前校验所有输入
针对无效输入返回清晰的错误信息
为所有输入字段设置大小限制

第二层：带退避的重试 针对瞬时故障（网络错误、速率限制、超时）：

yaml

Retry strategy:
  max_retries: 3
  initial_delay: 1s
  backoff_multiplier: 2
  max_delay: 30s
  retryable_errors: [429, 500, 502, 503, 504, TIMEOUT, CONNECTION_ERROR]
  non_retryable_errors: [400, 401, 403, 404]

第三层：降级响应 当重试次数耗尽时：

适用情况下使用缓存的历史响应
使用更简单/成本更低的模型作为降级方案
返回优雅降级的响应
流转至人工审核

第四层：熔断机制 当服务持续故障时：

yaml

Circuit breaker:
  failure_threshold: 5 consecutive failures
  state: CLOSED → OPEN (after threshold) → HALF_OPEN (after cooldown)
  cooldown: 60 seconds
  half_open_max_requests: 1

第五层：超时控制 所有外部调用都需要设置超时：

模型API调用：根据任务不同设置30-120秒
工具执行：根据工具不同设置10-60秒
数据库查询：5-15秒
第三方API：10-30秒

Fortification Audit

强化审计

For each component, verify:

Input validation present
Retry logic for transient failures
Fallback for when retries fail
Timeout set
Error logged with context
User gets a meaningful error (not a stack trace)

针对每个组件，验证以下项：

已配置输入校验
已配置瞬时故障的重试逻辑
重试失败时存在降级方案
已设置超时
错误已附带上下文日志
用户会收到有意义的错误提示（而非堆栈信息）

Recommended Next Step

建议后续步骤

After fortification, run

{{command_prefix}}evaluate

to verify error handling works under realistic failure scenarios.

NEVER:

Retry non-retryable errors (authentication failures, validation errors)
Retry without backoff (you'll make the problem worse)
Swallow errors silently (log and handle, don't ignore)
Set infinite timeouts (they'll hang forever)
Skip the fallback (retries exhausted with no fallback = user sees an error)

强化完成后，运行

{{command_prefix}}evaluate

来验证错误处理在真实故障场景下的可用性。

禁止操作：

重试不可重试的错误（鉴权失败、校验错误）
无退避逻辑的重试（会加剧故障）
静默吞掉错误（需记录日志并处理，不要忽略）
设置无限超时（会导致进程永久挂起）
省略降级方案（重试耗尽且无降级=用户会看到报错）

fortify

Original

Translation

MANDATORY PREPARATION

必备准备工作

Fortification Layers

强化分层

Fortification Audit

强化审计

Recommended Next Step

建议后续步骤