system-design-generator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

System Design Generator

系统设计生成器

Create comprehensive system architecture plans from requirements.
根据需求生成全面的系统架构方案。

System Design Document Template

系统设计文档模板

markdown
undefined
markdown
undefined

System Design: [Feature/Product Name]

系统设计:[功能/产品名称]

Overview

概述

Brief description of what we're building and why.
简要说明我们要构建的内容及其目的。

Requirements

需求

Functional

功能性需求

  • User can upload videos (max 1GB)
  • System processes video within 5 minutes
  • User receives notification when complete
  • 用户可上传视频(最大1GB)
  • 系统需在5分钟内完成视频处理
  • 处理完成后向用户发送通知

Non-Functional

非功能性需求

  • Handle 1000 uploads/day
  • 99.9% uptime
  • Process videos in <5 minutes (p95)
  • Cost: <$0.50 per video
  • 每日可处理1000次上传
  • 99.9%的可用性
  • 视频处理耗时需小于5分钟(p95分位)
  • 单视频处理成本低于$0.50

High-Level Architecture

高层架构


┌─────────┐ ┌──────────┐ ┌─────────────┐
│ Client │─────▶│ API │─────▶│ Upload │
│ │ │ Gateway │ │ Service │
└─────────┘ └──────────┘ └─────────────┘
┌─────────────┐
│ Storage │
│ (S3) │
└─────────────┘
┌─────────────┐
│ Processing │◀─┐
│ Queue │ │
└─────────────┘ │
│ │
▼ │
┌─────────────┐ │
│ Processor │─┘
│ Workers │
└─────────────┘
┌─────────────┐
│Notification │
│ Service │
└─────────────┘

┌─────────┐ ┌──────────┐ ┌─────────────┐
│ 客户端 │─────▶│ API网关 │─────▶│ 上传服务 │
│         │         │         │             │
└─────────┘ └──────────┘ └─────────────┘
┌─────────────┐
│ 存储服务 │
│ (S3) │
└─────────────┘
┌─────────────┐
│ 处理队列 │◀─┐
│             │ │
└─────────────┘ │
│ │
▼ │
┌─────────────┐ │
│ 处理工作节点 │─┘
│             │
└─────────────┘
┌─────────────┐
│通知服务 │
│             │
└─────────────┘

Components

组件

1. API Gateway

1. API网关

Responsibilities:
  • Authentication
  • Rate limiting
  • Request routing
Technology: Kong/AWS API Gateway Scaling: Auto-scale based on requests/sec
职责:
  • 身份认证
  • 请求限流
  • 请求路由
技术选型: Kong/AWS API Gateway 扩容方式: 根据每秒请求数自动扩容

2. Upload Service

2. 上传服务

Responsibilities:
  • Generate pre-signed S3 URLs
  • Validate file metadata
  • Enqueue processing jobs
API:

POST /uploads
Request: { filename, size, content_type }
Response: { upload_url, upload_id }
Technology: Node.js + Express Scaling: Horizontal (stateless)
职责:
  • 生成预签名S3 URL
  • 验证文件元数据
  • 将处理任务加入队列
API:

POST /uploads
请求体: { filename, size, content_type }
响应体: { upload_url, upload_id }
技术选型: Node.js + Express 扩容方式: 水平扩容(无状态)

3. Storage (S3)

3. 存储服务(S3)

Responsibilities:
  • Store raw videos
  • Store processed outputs
  • Serve content via CDN
Structure:

/uploads/{user_id}/{upload_id}/original.mp4
/processed/{user_id}/{upload_id}/output.mp4
职责:
  • 存储原始视频
  • 存储处理后的输出文件
  • 通过CDN提供内容分发
存储结构:

/uploads/{user_id}/{upload_id}/original.mp4
/processed/{user_id}/{upload_id}/output.mp4

4. Processing Queue

4. 处理队列

Responsibilities:
  • Buffer processing jobs
  • Ensure at-least-once delivery
  • DLQ for failed jobs
Technology: AWS SQS Configuration:
  • Visibility timeout: 15 minutes
  • DLQ after 3 retries
职责:
  • 缓冲处理任务
  • 确保任务至少被投递一次
  • 为失败任务设置死信队列(DLQ)
技术选型: AWS SQS 配置:
  • 可见性超时:15分钟
  • 重试3次后进入死信队列

5. Processor Workers

5. 处理工作节点

Responsibilities:
  • Transcode videos
  • Generate thumbnails
  • Update database
Technology: Python + FFmpeg Scaling: Auto-scale on queue depth
职责:
  • 视频转码
  • 生成缩略图
  • 更新数据库状态
技术选型: Python + FFmpeg 扩容方式: 根据队列深度自动扩容

Data Flow

数据流

Upload Flow

上传流程

  1. Client requests upload URL from Upload Service
  2. Upload Service generates pre-signed S3 URL
  3. Client uploads directly to S3
  4. Client notifies Upload Service of completion
  5. Upload Service enqueues processing job
  6. Returns upload_id to client
  1. 客户端向上传服务请求上传URL
  2. 上传服务生成预签名S3 URL
  3. 客户端直接向S3上传文件
  4. 客户端通知上传服务上传完成
  5. 上传服务将处理任务加入队列
  6. 向客户端返回upload_id

Processing Flow

处理流程

  1. Worker polls queue for jobs
  2. Downloads video from S3
  3. Processes video (transcode, thumbnail)
  4. Uploads results to S3
  5. Updates database status
  6. Sends notification
  7. Deletes message from queue
  1. 工作节点轮询队列获取任务
  2. 从S3下载视频
  3. 处理视频(转码、生成缩略图)
  4. 将处理结果上传至S3
  5. 更新数据库中的任务状态
  6. 发送通知
  7. 从队列中删除任务消息

Data Model

数据模型

typescript
interface Upload {
  id: string;
  user_id: string;
  filename: string;
  size: number;
  status: 'pending' | 'processing' | 'complete' | 'failed';
  original_url: string;
  processed_url?: string;
  created_at: Date;
  processed_at?: Date;
}

interface ProcessingJob {
  upload_id: string;
  attempts: number;
  error?: string;
}
typescript
interface Upload {
  id: string;
  user_id: string;
  filename: string;
  size: number;
  status: 'pending' | 'processing' | 'complete' | 'failed';
  original_url: string;
  processed_url?: string;
  created_at: Date;
  processed_at?: Date;
}

interface ProcessingJob {
  upload_id: string;
  attempts: number;
  error?: string;
}

API Contract

API契约

Upload Endpoints

上传相关接口

POST   /uploads           - Request upload URL
GET    /uploads/:id       - Get upload status
DELETE /uploads/:id       - Cancel upload
GET    /uploads           - List user uploads
POST   /uploads           - 请求上传URL
GET    /uploads/:id       - 获取上传状态
DELETE /uploads/:id       - 取消上传
GET    /uploads           - 列出用户所有上传记录

Webhooks

Webhook

POST {webhook_url}
{
  "event": "upload.completed",
  "upload_id": "...",
  "status": "complete",
  "processed_url": "..."
}
POST {webhook_url}
{
  "event": "upload.completed",
  "upload_id": "...",
  "status": "complete",
  "processed_url": "..."
}

Scaling Considerations

扩容考量

Current Capacity

当前容量

  • 1000 uploads/day = ~1 per minute
  • Single worker can process 1 video every 5 minutes
  • Need 5 workers for current load
  • 每日1000次上传 = 约每分钟1次
  • 单个工作节点每5分钟可处理1个视频
  • 当前负载下需要5个工作节点

10x Scale (10,000/day)

10倍扩容(每日10000次)

  • ~10 uploads per minute
  • Need 50 workers
  • Use spot instances for cost savings
  • Add Redis cache for status checks
  • 约每分钟10次上传
  • 需要50个工作节点
  • 使用竞价实例降低成本
  • 新增Redis缓存用于状态查询

100x Scale (100,000/day)

100倍扩容(每日100000次)

  • ~100 uploads per minute
  • Partition by region
  • Use Kafka instead of SQS
  • Database sharding by user_id
  • 约每分钟100次上传
  • 按区域进行分区
  • 用Kafka替代SQS
  • 按user_id对数据库进行分片

Failure Modes

故障场景

S3 Unavailable

S3不可用

  • Impact: Uploads fail
  • Mitigation: Multi-region S3 replication
  • 影响:上传失败
  • 缓解方案:多区域S3复制

Queue Backed Up

队列任务堆积

  • Impact: Processing delays
  • Mitigation: Auto-scale workers faster
  • 影响:处理延迟
  • 缓解方案:加快工作节点的自动扩容速度

Worker Crash During Processing

工作节点处理时崩溃

  • Impact: Job retried
  • Mitigation: Idempotent processing
  • 影响:任务会被重试
  • 缓解方案:实现幂等性处理

Cost Estimate

成本估算

Monthly (1000 uploads/day):
  • S3 Storage: $50
  • S3 Transfer: $100
  • SQS: $10
  • Workers (EC2): $300
  • Database: $100 Total: ~$560/month
月度成本(每日1000次上传):
  • S3存储:$50
  • S3流量费用:$100
  • SQS费用:$10
  • 工作节点(EC2):$300
  • 数据库:$100 总计:约$560/月

Security

安全考量

  • Pre-signed URLs expire in 1 hour
  • Videos in private S3 buckets
  • CloudFront signed URLs for delivery
  • Rate limiting per user
  • 预签名URL有效期为1小时
  • 视频存储在私有S3桶中
  • 使用CloudFront签名URL进行内容分发
  • 按用户维度设置请求限流

Monitoring

监控方案

Metrics:
  • Upload success rate
  • Processing time (p50, p95, p99)
  • Queue depth
  • Worker CPU/memory
  • Error rate by type
Alerts:
  • Queue depth >1000
  • Processing time p95 >10 minutes
  • Error rate >5%
核心指标:
  • 上传成功率
  • 处理耗时(p50、p95、p99分位)
  • 队列深度
  • 工作节点CPU/内存使用率
  • 按类型统计的错误率
告警规则:
  • 队列深度超过1000
  • 处理耗时p95分位超过10分钟
  • 错误率超过5%

Open Questions

待确认问题

  • Video retention policy? (30 days? 1 year?)
  • Maximum video duration? (affects processing time)
  • Regional data residency requirements?
undefined
  • 视频保留策略?(30天?1年?)
  • 最大视频时长?(会影响处理耗时)
  • 区域数据驻留要求?
undefined

Component Template

组件模板

markdown
undefined
markdown
undefined

Component Name

组件名称

Responsibilities:
  • Primary responsibility
  • Secondary responsibility
Technology Stack:
  • Language: [Python/Node/Go]
  • Framework: [Express/FastAPI/Gin]
  • Database: [PostgreSQL/MongoDB]
API/Interface:
typescript
interface ComponentAPI {
  method(params): ReturnType;
}
Scaling Strategy:
  • Horizontal: Stateless, load balanced
  • Vertical: Cache layer, connection pooling
Dependencies:
  • Service A (for X)
  • Database B (for persistence)
Failure Handling:
  • Retry with exponential backoff
  • Circuit breaker for downstream services
  • Fallback to cached data
undefined
职责:
  • 核心职责
  • 次要职责
技术栈:
  • 编程语言:[Python/Node/Go]
  • 框架:[Express/FastAPI/Gin]
  • 数据库:[PostgreSQL/MongoDB]
API/接口定义:
typescript
interface ComponentAPI {
  method(params): ReturnType;
}
扩容策略:
  • 水平扩容:无状态、负载均衡
  • 垂直优化:缓存层、连接池
依赖项:
  • 服务A(用于实现X功能)
  • 数据库B(用于持久化存储)
故障处理:
  • 指数退避重试
  • 下游服务熔断机制
  • 降级至缓存数据
undefined

Best Practices

最佳实践

  1. Start with requirements: Functional + non-functional
  2. Draw diagrams first: Visual clarity
  3. Define boundaries: What's in scope vs out
  4. Document tradeoffs: Every choice has costs
  5. Plan for failure: What breaks and how to handle
  6. Consider scale: Current, 10x, 100x
  7. Estimate costs: Build vs buy decisions
  8. Leave open questions: Don't pretend to know everything
  1. 从需求出发:覆盖功能性+非功能性需求
  2. 先画架构图:提升视觉清晰度
  3. 明确边界:区分范围内外的内容
  4. 记录取舍:每个决策都有对应的成本
  5. 提前规划故障处理:识别可能的故障点及应对方案
  6. 考虑不同规模:当前、10倍、100倍扩容场景
  7. 估算成本:辅助自研vs采购的决策
  8. 保留待确认问题:不要假装所有问题都有答案

Output Checklist

输出检查清单

  • Requirements documented (functional + non-functional)
  • High-level architecture diagram
  • Component breakdown (3-7 components)
  • Data flow documented
  • Data model defined
  • API contracts specified
  • Scaling considerations (1x, 10x, 100x)
  • Failure modes identified
  • Cost estimate provided
  • Security considerations
  • Monitoring plan
undefined
  • 已记录需求(功能性+非功能性)
  • 已提供高层架构图
  • 已拆分组件(3-7个)
  • 已记录数据流
  • 已定义数据模型
  • 已明确API契约
  • 已考虑扩容场景(1x、10x、100x)
  • 已识别故障场景
  • 已提供成本估算
  • 已考虑安全因素
  • 已制定监控方案
undefined