n8n-production-readiness
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesen8n Production Readiness
n8n 生产就绪指南
Match workflow hardening to actual risk. Not every workflow needs the same rigor.
"The workflow you build locally... that's maybe 20% of what you actually need. The other 80% is security, validation, logging, and error handling."
根据实际风险匹配工作流加固程度。并非所有工作流都需要相同的严谨性。
"你在本地构建的工作流……可能只占实际所需的20%。剩下的80%是安全、验证、日志记录和错误处理。"
How to Use This Skill (AI Instructions)
如何使用该技能(AI指令)
The tier system adapts to the user. Some users want to specify a tier, others just want to build. Support both modes while always being ready to recommend changes based on what you observe.
分层系统会适配用户需求。有些用户希望指定层级,有些只想要构建工作流。需同时支持这两种模式,并且始终根据观察结果推荐调整方案。
After the Initial Prompt: Ask Once
初始提示后:询问一次
After the user's first message about an n8n workflow, ask once to establish the tier — but make it easy to skip:
"Quick question before we dive in — what level of hardening does this workflow need?
- Tier 1 (Internal/Prototype): Quick and simple. Basic error handling, minimal setup.
- Tier 2 (Production): Client-facing or business-critical. Full validation, logging, proper error responses.
- Tier 3 (Mission-Critical): Payments, compliance, high-volume. Everything plus monitoring, rollback plans, idempotency.
Or just say 'autopilot' and I'll figure it out as we go based on what you're building."
If the user picks a tier: Build to that tier. Respect their choice, but still monitor for signals that suggest they need to adjust (see Tier Change Recommendations below).
If the user says "autopilot" or skips the question: Infer the tier silently from context and adapt as you go. Don't mention tiers again unless recommending a change.
If the user doesn't respond to the tier question: Assume autopilot mode and proceed.
在用户首次提及n8n工作流的消息后,询问一次以确定层级——但要让用户可以轻松跳过:
"在深入之前快速问一句:这个工作流需要什么级别的加固?
- Tier 1(内部/原型):快速简单。仅包含基础错误处理,设置最少。
- Tier 2(生产环境):面向客户或业务关键型。完整的验证、日志记录、规范的错误响应。
- Tier 3(关键业务):支付、合规、高流量场景。包含所有Tier 2内容,外加监控、回滚方案、幂等性处理。
或者直接说**'autopilot'(自动驾驶)**,我会根据你构建的内容自行判断。"
如果用户选择了层级:按照该层级构建。尊重用户的选择,但仍需留意是否有需要调整的信号(见下文的层级变更建议)。
如果用户说“autopilot”或跳过问题:根据上下文静默推断层级并相应调整。除非需要推荐变更,否则不要提及层级。
如果用户未回应层级问题:默认使用autopilot模式继续。
Respecting Explicit Tier Requests
尊重明确的层级请求
If a user explicitly asks for a tier (now or later in the conversation):
- Build exactly to that tier's specifications
- If they ask "what's in Tier 2?" — explain it
- If they say "make this Tier 3" — upgrade accordingly
- If they say "just keep it Tier 1" — simplify and remove hardening
Users who understand tiers get full control. Don't second-guess them on every decision — but still flag concerns if you observe serious issues.
如果用户明确要求某个层级(无论是现在还是对话后期):
- 严格按照该层级的规范构建
- 如果用户问“Tier 2包含什么内容?”——进行解释
- 如果用户说“把这个升级到Tier 3”——相应地增强加固程度
- 如果用户说“就保持Tier 1”——简化并移除额外的加固内容
了解层级的用户拥有完全控制权。不要对他们的每一个决定指手画脚,但如果发现严重问题仍需提出警示。
Tier Lookup Requests
层级查询请求
If a user asks to see the tiers, explain them, or wants to understand the differences, provide the full breakdown:
Tier 1 (Internal/Prototype)
- Basic null checks, simple try-catch
- n8n's built-in error handling
- ~10% extra build time
- Example: Slack notifications, personal automations
Tier 2 (Production)
- Full input validation at entry point
- External logging (Supabase/Postgres)
- Proper HTTP status codes (400, 401, 404, 500)
- Error notifications to team
- Pre-deployment breaking tests
- ~80% extra build time
- Example: Client form handlers, business integrations
Tier 3 (Mission-Critical)
- Everything in Tier 2, plus:
- Monitoring dashboards, alerting (PagerDuty)
- Idempotency keys, rate limiting
- Rollback strategy, audit logging
- 2-3x build time
- Example: Payment processing, HIPAA compliance
如果用户要求查看层级说明、解释层级或了解不同层级的差异,提供完整的细分内容:
Tier 1(内部/原型)
- 基础空值检查、简单的try-catch
- 使用n8n内置的错误处理
- 仅增加约10%的构建时间
- 示例:Slack通知、个人自动化流程
Tier 2(生产环境)
- 入口点的完整输入验证
- 外部日志记录(Supabase/Postgres)
- 规范的HTTP状态码(400、401、404、500)
- 向团队发送错误通知
- 部署前的破坏性测试
- 增加约80%的构建时间
- 示例:客户表单处理程序、业务集成
Tier 3(关键业务)
- 包含Tier 2的所有内容,外加:
- 监控仪表板、告警(PagerDuty)
- 幂等键、速率限制
- 回滚策略、审计日志
- 构建时间为核心逻辑的2-3倍
- 示例:支付处理、HIPAA合规场景
Autopilot Mode: Silent Tier Assessment
Autopilot模式:静默层级评估
When in autopilot (or if the user skipped tier selection), infer the tier from context clues:
| Context Clues | Internal Tier | Your Approach |
|---|---|---|
| "quick", "test", "just for me", "internal", "prototype", "playing around" | Tier 1 | Build fast, basic error handling only |
| "client", "customer", "users", "production", "deploy", "business", "launch" | Tier 2 | Include validation, logging, status codes |
| "payment", "Stripe", "checkout", "HIPAA", "compliance", "SLA", "enterprise", "high volume", "can't fail" | Tier 3 | Full hardening + discuss monitoring/ops |
| Ambiguous or no clear signals | Tier 1 | Start simple, monitor for escalation triggers |
In autopilot mode, don't announce the tier. Just build appropriately and intervene if a tier change is needed.
当处于autopilot模式(或用户跳过了层级选择),根据上下文线索推断层级:
| 上下文线索 | 内部层级 | 应对方法 |
|---|---|---|
| "快速"、"测试"、"仅用于个人"、"内部"、"原型"、"尝试" | Tier 1 | 快速构建,仅包含基础错误处理 |
| "客户"、"顾客"、"用户"、"生产"、"部署"、"业务"、"上线" | Tier 2 | 包含验证、日志记录、状态码处理 |
| "支付"、"Stripe"、"结账"、"HIPAA"、"合规"、"SLA"、"企业"、"高流量"、"不能失败" | Tier 3 | 完整加固 + 讨论监控/运维方案 |
| 模糊或无明确信号 | Tier 1 | 从简单开始,监控是否有升级触发条件 |
在autopilot模式下,不要宣布层级。只需相应构建,并在需要层级变更时进行干预。
Tier Change Recommendations (Up or Down)
层级变更建议(升级或降级)
Whether the user picked a tier or is in autopilot, monitor the conversation and recommend tier changes in either direction when the task and outcome warrant it.
无论用户选择了层级还是处于autopilot模式,都要监控对话内容,当任务和结果需要时,推荐升级或降级层级。
Recommend Moving UP a Tier
推荐升级层级
Tier 1 → Tier 2 triggers:
Tier 1 → Tier 2的触发条件:
Troubleshooting is going in circles:
- 3+ back-and-forth messages on the same issue
- User repeats "it's still not working" or "same problem"
- You're guessing at what the data looks like
→ Recommend: "We're spending a lot of time guessing. I'd recommend adding logging so we can see exactly what data is coming in and where it's failing. This is a Tier 2 pattern — want me to add it?"
Silent or unclear failures:
- User says "it just doesn't work" or "nothing happens"
- User says "sometimes it works, sometimes it doesn't"
- No error messages to work with
→ Recommend: "The workflow is failing silently, which makes this hard to diagnose. I'd suggest adding entry-point validation and external logging — that way we'll see exactly what's coming in and where it breaks. Want me to upgrade this to Tier 2 patterns?"
User mentions deployment or sharing:
- "ready to deploy", "going live", "share with the team"
- "my client", "users will", "in production"
- Moving from test to real environment
→ Recommend: "Since this is going to production, I'd recommend hardening it first — input validation, proper error responses, and logging. It's the difference between 'works on my machine' and 'survives in the wild.' Should I add Tier 2 patterns?"
The "null vs empty string" pattern:
- Data that "should" be there is missing
- Workflow processes but outputs garbage
- Inconsistent behavior with same inputs
→ Recommend: "This looks like a data shape issue — probably null values or missing fields slipping through. Tier 2 validation would catch this at the entry point. Want me to add it?"
排查问题陷入循环:
- 针对同一问题来回沟通3次以上
- 用户重复说“还是不行”或“同样的问题”
- 你需要猜测数据的实际情况
→ 建议:“我们花了很多时间在猜测上。我建议添加日志记录,这样我们就能准确看到传入的数据和失败的位置。这是Tier 2的模式——需要我添加吗?”
静默或不明确的失败:
- 用户说“就是不行”或“什么反应都没有”
- 用户说“有时能用,有时不能”
- 没有错误信息可供排查
→ 建议:“工作流出现了静默失败,这让排查变得困难。我建议添加入口点验证和外部日志记录——这样我们就能准确看到传入的数据和失败的位置。需要我把这个升级到Tier 2模式吗?”
用户提及部署或共享:
- “准备部署”、“即将上线”、“分享给团队”
- “我的客户”、“用户会使用”、“生产环境”
- 从测试环境迁移到真实环境
→ 建议:“既然要部署到生产环境,我建议先进行加固——输入验证、规范的错误响应和日志记录。这是‘在我本地能用’和‘在生产环境稳定运行’的区别。需要我添加Tier 2的模式吗?”
“空值 vs 空字符串”问题:
- 本应存在的数据缺失
- 工作流运行但输出无效内容
- 相同输入出现不一致的行为
→ 建议:“这看起来是数据格式问题——可能是空值或缺失字段未被处理。Tier 2的验证可以在入口点就拦截这类问题。需要我添加吗?”
Tier 2 → Tier 3 triggers:
Tier 2 → Tier 3的触发条件:
Scale or volume concerns:
- User mentions request counts (1000+/day)
- User asks about performance or speed
- User asks "what if I get a lot of traffic?"
→ Recommend: "At that volume, you'll want Tier 3 patterns — rate limiting, idempotency handling for duplicate requests, and queue-based processing for traffic spikes. Want me to walk through what that looks like?"
Reliability concerns:
- User asks "what if [service] goes down?"
- User asks about retries or fallbacks
- User mentions uptime requirements or SLAs
→ Recommend: "For that level of reliability, I'd recommend Tier 3 — monitoring so you know when something's wrong before users tell you, intelligent retries, and a rollback plan. Should we upgrade?"
Financial or compliance context:
- Payment processing, billing, invoicing
- Healthcare, legal, financial data
- User mentions audits or compliance requirements
→ Recommend: "Since this handles [payments/sensitive data], I'd strongly recommend Tier 3 patterns — idempotency keys so duplicate requests don't double-charge, audit logging for compliance, and error handling that doesn't expose sensitive data. This is important — want me to add it?"
规模或流量担忧:
- 用户提及请求量(每天1000+)
- 用户询问性能或速度问题
- 用户问“如果流量变大怎么办?”
→ 建议:“在这个流量规模下,你需要Tier 3的模式——速率限制、重复请求的幂等性处理,以及针对流量峰值的队列式处理。需要我详细说明吗?”
可靠性担忧:
- 用户问“如果[服务]宕机了怎么办?”
- 用户询问重试或 fallback 方案
- 用户提及可用性要求或SLA
→ 建议:“为了达到这样的可靠性,我建议使用Tier 3——监控系统能让你在用户反馈之前就发现问题,智能重试机制,以及回滚方案。需要升级吗?”
财务或合规场景:
- 支付处理、账单、发票
- 医疗、法律、金融数据
- 用户提及审计或合规要求
→ 建议:“由于这个工作流处理[支付/敏感数据],我强烈建议使用Tier 3的模式——幂等键可以避免重复请求导致重复收费,审计日志满足合规要求,错误处理不会暴露敏感数据。这很重要——需要我添加吗?”
Recommend Moving DOWN a Tier
推荐降级层级
Tier 3 → Tier 2 triggers:
Tier 3 → Tier 2的触发条件:
Over-engineered for actual use case:
- User's volume is actually low (< 1000/day)
- No real compliance requirements after clarification
- User is spending too much time on ops concerns for a simple workflow
→ Recommend: "Looking at this more, I think we might be over-engineering. If you're not dealing with high volume or strict compliance requirements, we could drop to Tier 2 — still production-ready, but without the monitoring and idempotency overhead. Would that be simpler for your needs?"
User wants to ship faster:
- User expresses frustration with complexity
- User says "this is taking too long"
- Iteration speed is suffering
→ Recommend: "We've been building this with full Tier 3 hardening, but it's slowing us down. Want to drop to Tier 2 for now and add the monitoring/idempotency later once the core logic is stable?"
针对实际场景过度设计:
- 用户的实际流量很低(每天<1000次)
- 澄清后发现没有真正的合规要求
- 用户为一个简单的工作流花费了太多时间在运维问题上
→ 建议:“进一步了解后,我觉得我们可能过度设计了。如果你不需要处理高流量或严格的合规要求,可以降级到Tier 2——仍然是生产就绪的,但没有监控和幂等性的额外开销。这样会不会更符合你的需求?”
用户希望更快交付:
- 用户对复杂度表示不满
- 用户说“这花的时间太长了”
- 迭代速度受到影响
→ 建议:“我们一直在按照完整的Tier 3标准构建,但这拖慢了进度。要不要先降级到Tier 2,等核心逻辑稳定后再添加监控和幂等性处理?”
Tier 2 → Tier 1 triggers:
Tier 2 → Tier 1的触发条件:
It's actually just a prototype:
- User reveals "this is just a test" or "I'm experimenting"
- User says "I just want to see if it works"
- The workflow is for personal/internal use only
→ Recommend: "If this is just for testing the concept, we don't need all this validation and logging yet. Want me to simplify to Tier 1? We can always add the production hardening later once you're happy with the logic."
Complexity is blocking progress:
- User is stuck on logging/validation setup, not core logic
- The hardening is more work than the actual workflow
- User seems overwhelmed
→ Recommend: "I think we're getting bogged down in the production hardening before the core workflow is even working. Let's drop to Tier 1, get the basic flow working, and then add validation and logging once we know the logic is right. Sound good?"
Scope changed:
- User initially said "client" but now it's "just for me"
- Requirements relaxed during conversation
- Stakes are lower than initially understood
→ Recommend: "Since this is actually just for internal use, we don't need the full Tier 2 treatment. Want me to strip out the external logging and simplify the error handling? It'll be easier to maintain."
实际上只是原型:
- 用户透露“这只是个测试”或“我在做实验”
- 用户说“我只是想看看能不能行”
- 工作流仅用于个人/内部使用
→ 建议:“如果这只是为了测试概念,我们现在不需要这些验证和日志记录。要不要简化到Tier 1?等你对逻辑满意后,我们再添加生产环境的加固内容。”
复杂度阻碍了进展:
- 用户卡在日志/验证设置上,而不是核心逻辑
- 加固工作比工作流本身更耗时
- 用户看起来不知所措
→ 建议:“我觉得我们在核心工作流还没跑通之前,就陷入了生产环境加固的细节中。先降级到Tier 1,让基础流程跑通,等逻辑确定后再添加验证和日志记录。这样可以吗?”
范围变更:
- 用户最初说“面向客户”但现在变成“仅用于个人”
- 对话中需求放宽
- 实际风险比最初理解的更低
→ 建议:“既然这实际上只是内部使用,我们不需要完整的Tier 2处理。要不要移除外部日志记录并简化错误处理?这样维护起来更简单。”
When NOT to Recommend a Change
何时不推荐变更
Don't recommend moving up if:
- User explicitly chose a lower tier and the workflow is working fine
- The issue is a simple bug, not a systemic pattern
- User is in early exploration/prototyping phase
Don't recommend moving down if:
- User explicitly chose a higher tier for good reasons
- The workflow handles sensitive data or money
- User has mentioned compliance or SLA requirements
Trust user judgment, but flag concerns:
"I know you want to keep this at Tier 1, and that's fine for now — just know that once this goes to production, you'll probably want to add logging at minimum. I can help with that when you're ready."
不要推荐升级的情况:
- 用户明确选择了较低层级,且工作流运行正常
- 问题是简单的bug,而非系统性模式
- 用户处于早期探索/原型阶段
不要推荐降级的情况:
- 用户出于合理原因明确选择了较高层级
- 工作流处理敏感数据或资金
- 用户提及合规或SLA要求
尊重用户判断,但提出警示:
“我知道你想保持Tier 1,这目前没问题——但要注意,一旦部署到生产环境,你可能至少需要添加日志记录。等你准备好的时候我可以帮忙。”
How to Recommend Changes
如何推荐变更
When suggesting tier changes, be direct but not pushy:
- State what you're observing — the pattern that triggered the recommendation
- Explain the benefit — why this tier's patterns would help
- Ask, don't mandate — "Want me to add it?" / "Should we upgrade?"
- Respect the answer — if they say no, continue at current tier
If the user explicitly chose a tier and you're recommending a change:
"I know you said Tier 1, but we've been debugging this for a while and I think Tier 2 logging would save us time here. Up to you — want me to add it, or keep it simple?"
If in autopilot mode:
"I'm going to add some validation and logging here — we're hitting the kind of silent failures that these patterns are designed to catch. This'll take a few extra minutes but should make debugging much faster."
建议层级变更时,要直接但不要强硬:
- 说明观察到的情况——触发推荐的模式
- 解释好处——该层级的模式能解决什么问题
- 询问而非命令——“需要我添加吗?” / “要不要升级?”
- 尊重回答——如果用户说不,继续使用当前层级
如果用户明确选择了层级但你需要推荐变更:
“我知道你说过要Tier 1,但我们已经调试了一段时间,我觉得Tier 2的日志记录能帮我们节省时间。由你决定——需要我添加,还是保持简单?”
如果处于autopilot模式:
“我要在这里添加一些验证和日志记录——我们遇到的正是这类模式要解决的静默失败问题。这会多花几分钟时间,但能让调试变得快得多。”
Lifecycle-Aware Behavior
全生命周期适配行为
Adjust your approach based on where the user is in the workflow lifecycle:
| Lifecycle Stage | How to Detect | Your Approach |
|---|---|---|
| Exploring/Prototyping | "trying to figure out", "is this possible?", "how would I..." | Tier 1. Fast, minimal. Get it working first. |
| Building | "build me", "create", "I need a workflow that..." | Ask tier or infer from context. |
| Testing | "let me test", "trying it out", "it works but..." | Stay at current tier. Focus on the specific issue. |
| Debugging | "not working", "error", "broken", "help" | Monitor for escalation triggers. Recommend logging if stuck. |
| Pre-deployment | "ready to deploy", "going live", "production" | Recommend Tier 2 minimum if not already there. |
| Post-deployment issues | "was working, now broken", "users are reporting", "in production" | Tier 2+. Recommend logging immediately to diagnose. |
| Scaling | "more users", "growing", "volume increasing" | Discuss Tier 3 patterns as relevant. |
根据用户在工作流生命周期中的阶段调整方法:
| 生命周期阶段 | 识别方式 | 应对方法 |
|---|---|---|
| 探索/原型阶段 | “想弄清楚”、“这可行吗?”、“我该怎么……” | 使用Tier 1。快速构建,最简配置。先让流程跑通。 |
| 构建阶段 | “帮我构建”、“创建”、“我需要一个能……的工作流” | 询问层级或根据上下文推断。 |
| 测试阶段 | “让我测试一下”、“正在尝试”、“能运行但……” | 保持当前层级。专注于具体问题。 |
| 调试阶段 | “不行”、“错误”、“坏了”、“帮忙” | 监控升级触发条件。如果陷入僵局,建议添加日志记录。 |
| 部署前阶段 | “准备部署”、“即将上线”、“生产环境” | 如果还没到Tier 2,建议至少升级到Tier 2。 |
| 部署后问题排查 | “之前能用现在坏了”、“用户反馈”、“生产环境” | 使用Tier 2+。立即建议添加日志记录以排查问题。 |
| 扩容阶段 | “更多用户”、“增长”、“流量增加” | 讨论相关的Tier 3模式。 |
Tier Definitions (Reference)
层级定义(参考)
Tier 1: Internal / Prototype
Tier 1:内部 / 原型
Context signals: "quick", "simple", "just for me", "internal", "prototype", "test", "playing around"
What to include:
- Basic null checks (,
|| {})|| '' - Simple try-catch around risky operations
- n8n's built-in error handling (Error Trigger → Slack notification)
What to skip:
- External logging database
- Comprehensive input validation
- Full status code handling
- Extensive breaking tests
Example workflows: "New GitHub issue → Slack notification", "Daily weather → personal email", "RSS feed → Discord"
Build time impact: ~10% extra beyond core logic
上下文信号:“快速”、“简单”、“仅用于个人”、“内部”、“原型”、“测试”、“尝试”
包含内容:
- 基础空值检查(,
|| {})|| '' - 针对高风险操作的简单try-catch
- 使用n8n内置的错误处理(Error Trigger → Slack通知)
跳过内容:
- 外部日志数据库
- 全面的输入验证
- 完整的状态码处理
- 大量的破坏性测试
示例工作流:“新GitHub issue → Slack通知”、“每日天气 → 个人邮箱”、“RSS订阅 → Discord”
构建时间影响:比核心逻辑仅增加约10%
Tier 2: Production / Client-Facing
Tier 2:生产环境 / 面向客户
Context signals: "client", "customer", "production", "deploy", "users will...", "business", "launch", "going live"
Escalation triggers: Debugging going in circles, silent failures, user mentions deployment, "works sometimes" issues
What to include:
- Full entry-point validation (user exists? auth valid? data shaped right?)
- Explicit null vs empty string handling
- External logging to Supabase/Postgres
- Proper HTTP status codes (400, 401, 403, 404, 500)
- Error notifications to team (Slack/email)
- Pre-deployment breaking tests
- Test database before production
What to skip:
- Real-time monitoring dashboards
- Automated rollback
- Rate limiting / idempotency (unless high volume)
Example workflows: "Customer form → CRM + email sequence", "Payment webhook → order fulfillment", "AI chatbot for client website"
Build time impact: ~80% extra beyond core logic (the 80/20 rule)
上下文信号:“客户”、“顾客”、“生产”、“部署”、“用户会……”、“业务”、“上线”、“即将发布”
升级触发条件:排查问题陷入循环、静默失败、用户提及部署、“有时能用”的问题
包含内容:
- 完整的入口点验证(用户存在?认证有效?数据格式正确?)
- 明确区分空值、空字符串和未定义
- 外部日志记录到Supabase/Postgres
- 规范的HTTP状态码(400、401、403、404、500)
- 向团队发送错误通知(Slack/邮件)
- 部署前的破坏性测试
- 在测试数据库中验证后再部署到生产环境
跳过内容:
- 实时监控仪表板
- 自动回滚
- 速率限制/幂等性(除非流量很高)
示例工作流:“客户表单 → CRM + 邮件序列”、“支付Webhook → 订单履行”、“客户网站AI聊天机器人”
构建时间影响:比核心逻辑增加约80%(80/20法则)
Tier 3: Mission-Critical / High-Volume
Tier 3:关键业务 / 高流量
Context signals: "payment", "Stripe", "checkout", "HIPAA", "compliance", "SLA", "enterprise", "high volume", "can't fail", "audit"
Escalation triggers: Volume/performance questions, "what if X goes down?", financial transactions, compliance mentions
What to include:
- Everything from Tier 2, plus:
- Real-time monitoring dashboard (Grafana, Datadog)
- Automated alerting with escalation (PagerDuty)
- Idempotency keys for duplicate request handling
- Rate limiting to prevent cascade failures
- Request queuing for traffic spikes
- Rollback strategy (workflow versioning, feature flags)
- Audit logging for compliance
- Regular chaos testing (simulate failures)
- Documented runbooks for incident response
Example workflows: "Stripe payment → inventory + fulfillment + accounting", "HIPAA-compliant patient data sync", "High-traffic API gateway"
Build time impact: 2-3x the core logic development time
上下文信号:“支付”、“Stripe”、“结账”、“HIPAA”、“合规”、“SLA”、“企业”、“高流量”、“不能失败”、“审计”
升级触发条件:询问规模/性能问题、“如果X宕机了怎么办?”、财务交易、提及合规要求
包含内容:
- 包含Tier 2的所有内容,外加:
- 实时监控仪表板(Grafana、Datadog)
- 带升级机制的自动告警(PagerDuty)
- 处理重复请求的幂等键
- 防止级联失败的速率限制
- 应对流量峰值的请求队列
- 回滚策略(工作流版本控制、功能开关)
- 满足合规要求的审计日志
- 定期混沌测试(模拟失败场景)
- 记录事件响应的运行手册
示例工作流:“Stripe支付 → 库存 + 履行 + 记账”、“符合HIPAA的患者数据同步”、“高流量API网关”
构建时间影响:核心逻辑开发时间的2-3倍
Tier Escalation Triggers Summary
层级升级触发条件总结
You've outgrown Tier 1 when:
- You're debugging the same workflow for the third time
- You find yourself saying "it works sometimes"
- You can't tell what data the workflow actually received
- Someone other than you will use or depend on it
- A failure would cause more than minor annoyance
You've outgrown Tier 2 when:
- You're processing 10,000+ requests/day
- You're handling payments or sensitive PII
- You have contractual SLAs
- A 1-hour outage would cause significant revenue loss or legal exposure
- You're asking "what happens if [external service] goes down?"
Downgrade is okay too:
- Built Tier 2 for a prototype, realized it's overkill? Strip it back.
- The goal is right-sized investment, not maximum hardening.
当你需要从Tier 1升级时:
- 你已经第三次调试同一个工作流
- 你发现自己在说“有时能用”
- 你无法知道工作流实际接收的数据是什么
- 除你之外还有其他人会使用或依赖这个工作流
- 失败会造成的影响不止是轻微的不便
当你需要从Tier 2升级时:
- 你每天处理10,000+请求
- 你处理支付或敏感个人身份信息(PII)
- 你有合同约定的SLA
- 1小时的停机时间会造成重大收入损失或法律风险
- 你在问“如果[外部服务]宕机了怎么办?”
降级也是可以的:
- 为原型构建了Tier 2,后来发现过度设计了?简化回去。
- 目标是合理的投入,而非最大化加固程度。
The 80/20 Rule by Tier
各层级的80/20法则
| Component | Tier 1 | Tier 2 | Tier 3 |
|---|---|---|---|
| Core logic | 70% | 20% | 10% |
| Validation | 10% | 20% | 15% |
| Error handling | 10% | 20% | 20% |
| Logging | 5% | 20% | 20% |
| Testing | 5% | 20% | 15% |
| Monitoring/Ops | — | — | 20% |
The workflow logic is the easy part. The hard part is everything else — but only invest in "everything else" proportional to your risk.
| 组件 | Tier 1 | Tier 2 | Tier 3 |
|---|---|---|---|
| 核心逻辑 | 70% | 20% | 10% |
| 验证 | 10% | 20% | 15% |
| 错误处理 | 10% | 20% | 20% |
| 日志记录 | 5% | 20% | 20% |
| 测试 | 5% | 20% | 15% |
| 监控/运维 | — | — | 20% |
工作流逻辑是简单的部分。难的是其他所有内容——但要根据风险程度合理投入到“其他内容”上。
Pre-Deployment Checklists
部署前检查清单
Tier 1 Checklist
Tier 1检查清单
- Basic null/undefined checks on critical fields
- Try-catch around external API calls
- Error Trigger workflow sends Slack/email on failure
- Tested manually with happy path
- 对关键字段进行基础的空值/未定义检查
- 对外部API调用使用try-catch
- 配置Error Trigger工作流,在失败时发送Slack/邮件通知
- 通过手动测试验证正常流程
Tier 2 Checklist
Tier 2检查清单
- Validation: Every input field is validated
- Null handling: Explicitly handle vs empty string vs undefined
null - Type checking: Verify data types match expectations
- Logging: External logging configured at entry, decisions, output, errors
- Error responses: Proper HTTP status codes for all failure modes
- Error notifications: Team gets alerted on failures (Slack, email)
- Empty data test: Workflow handles empty inputs gracefully
- Wrong type test: Workflow rejects malformed data
- Auth test: Missing/invalid auth returns 401/403
- Downstream failure test: External service failures handled
- Test database: All tests run against test environment first
- Documentation: Workflow purpose and data flow documented
- 验证:每个输入字段都经过验证
- 空值处理:明确处理、空字符串和未定义
null - 类型检查:验证数据类型符合预期
- 日志记录:在入口点、决策点、输出和错误处配置外部日志记录
- 错误响应:为所有失败场景返回规范的HTTP状态码
- 错误通知:团队会收到失败告警(Slack、邮件)
- 空数据测试:工作流能优雅处理空输入
- 错误类型测试:工作流会拒绝格式错误的数据
- 认证测试:缺失/无效认证返回401/403
- 下游失败测试:处理外部服务失败的情况
- 测试数据库:所有测试先在测试环境运行
- 文档:记录工作流的用途和数据流
Tier 3 Checklist
Tier 3检查清单
- Everything from Tier 2, plus:
- Monitoring: Real-time dashboard configured
- Alerting: PagerDuty/escalation set up
- Idempotency: Duplicate requests handled safely
- Rate limiting: Traffic spikes won't cascade
- Rollback plan: Can revert to previous version quickly
- Runbook: Incident response documented
- Load test: Tested at 2-3x expected volume
- Chaos test: Simulated downstream failures
- 包含Tier 2的所有内容,外加:
- 监控:配置实时仪表板
- 告警:设置PagerDuty/升级机制
- 幂等性:安全处理重复请求
- 速率限制:流量峰值不会导致级联失败
- 回滚计划:能快速回滚到之前的版本
- 运行手册:记录事件响应流程
- 负载测试:以预期流量的2-3倍进行测试
- 混沌测试:模拟下游服务失败的场景
Summary: Adaptive Tier Management
总结:自适应层级管理
- Ask once after initial prompt — let user pick tier or choose autopilot
- Respect explicit tier choices — if they ask for a tier, build to it
- Support tier lookups — explain tiers when asked
- Infer silently in autopilot — don't mention tiers unless recommending a change
- Monitor for tier change triggers — both UP (more hardening needed) and DOWN (over-engineered)
- Recommend, don't mandate — ask permission, respect the answer
- Trust user judgment — but flag serious concerns even if they decline
The goal: Users who know tiers get control. Users who don't get invisible guidance. The AI adapts to the user, the task, and the project — recommending more hardening when complexity demands it, and simplification when it's getting in the way.
- 初始提示后询问一次——让用户选择层级或autopilot模式
- 尊重明确的层级选择——如果用户指定层级,按要求构建
- 支持层级查询——当用户询问时解释层级
- Autopilot模式下静默推断——除非需要推荐变更,否则不要提及层级
- 监控层级变更触发条件——包括升级(需要更多加固)和降级(过度设计)
- 推荐而非命令——请求许可,尊重用户的回答
- 尊重用户判断——但即使用户拒绝,也要提出严重问题的警示
目标:了解层级的用户拥有控制权。不了解层级的用户获得隐形指导。AI适配用户、任务和项目——当复杂度增加时推荐更多加固,当过度设计阻碍进展时推荐简化。
Related Skills
相关技能
For implementation patterns, see:
- ../n8n-workflow-patterns/webhook_processing.md — Validation, logging, status codes
- ../n8n-code-javascript/COMMON_PATTERNS.md — Validation and logging code templates
- ../n8n-code-javascript/ERROR_PATTERNS.md — Null handling, silent failure prevention
"Learn this stuff before your first emergency call. I've broken enough things to have most of the answers now."
如需实现模式,请查看:
- ../n8n-workflow-patterns/webhook_processing.md — 验证、日志记录、状态码
- ../n8n-code-javascript/COMMON_PATTERNS.md — 验证和日志记录代码模板
- ../n8n-code-javascript/ERROR_PATTERNS.md — 空值处理、静默失败预防
“在第一次紧急呼叫之前就学会这些。我踩过足够多的坑,现在知道大部分答案了。”",