logging-strategy
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLogging Strategy
日志策略
When to Use
适用场景
When the user is designing a logging approach for an application, reviewing existing log statements for quality, setting up log aggregation (ELK, Loki, CloudWatch), adding correlation IDs to a distributed system, or asking about what to log and what to avoid logging.
当用户需要为应用设计日志方案、评审现有日志语句的质量、搭建日志聚合系统(ELK、Loki、CloudWatch)、为分布式系统添加关联ID,或是询问应该记录哪些内容、避免记录哪些内容时使用。
Instructions
使用指南
Log Levels
日志级别
Use log levels consistently and intentionally:
- TRACE — Very fine-grained diagnostic detail (loop iterations, variable state). Never enable in production.
- DEBUG — Information useful during development (SQL queries, cache hits/misses, config values loaded). Off by default in production.
- INFO — Normal operational events that confirm the system is working (service started, request handled, job completed). This is the default production level.
- WARN — Something unexpected that the system can handle but that deserves attention (retry succeeded, deprecated API called, approaching rate limit). Not page-worthy.
- ERROR — A failure that prevented an operation from completing (unhandled exception, downstream service unreachable, write failed). Should trigger an alert or ticket.
- FATAL — The process cannot continue and is shutting down (missing required config, port already bound, unrecoverable state). Extremely rare.
Rule of thumb: if you're unsure between two levels, pick the lower one. Over-logging at DEBUG is better than missing context at ERROR.
请一致且有针对性地使用日志级别:
- TRACE — 极细粒度的诊断细节(循环迭代、变量状态),生产环境绝对不要启用。
- DEBUG — 开发过程中有用的信息(SQL查询、缓存命中/未命中、加载的配置值),生产环境默认关闭。
- INFO — 确认系统正常运行的常规操作事件(服务启动、请求处理完成、任务执行完毕),这是生产环境的默认级别。
- WARN — 系统可以处理但需要引起注意的意外情况(重试成功、调用了已废弃的API、即将达到速率限制),不需要触发告警通知。
- ERROR — 导致操作无法完成的故障(未处理的异常、下游服务不可达、写入失败),应该触发告警或工单。
- FATAL — 进程无法继续运行即将关闭(缺少必需的配置、端口已被占用、不可恢复的状态),极少出现。
经验法则:如果你不确定选两个级别中的哪一个,就选更低的那个。DEBUG级别日志过多总比ERROR级别缺少上下文要好。
Structured vs Unstructured Logging
结构化与非结构化日志
Always prefer structured logging (JSON or key-value pairs) over unstructured text:
- Structured logs are machine-parseable, searchable, and filterable
- Unstructured logs require regex parsing and break when message formats change
- Every log line should include: ,
timestamp,level,service, and amessageorrequest_idtrace_id - Use the reference file for language-specific libraries and patterns
references/structured-logging.md
始终优先选择结构化日志(JSON或键值对)而非非结构化文本:
- 结构化日志支持机器解析、搜索和过滤
- 非结构化日志需要正则解析,且消息格式变化时会失效
- 每一行日志都应该包含:、
timestamp、level、service,以及message或request_idtrace_id - 可参考文件获取特定语言的库和实践模式
references/structured-logging.md
Correlation IDs and Context Propagation
关联ID与上下文传播
- Generate a unique (UUID or ULID) at the system boundary (API gateway, message consumer)
request_id - Propagate it through all downstream calls via headers () or message metadata
X-Request-ID - Include the in every log line so all logs for a single request can be correlated
request_id - In distributed systems, also propagate and
trace_id(W3C Trace Context format)span_id - Use MDC (Mapped Diagnostic Context) or equivalent to avoid passing IDs through every function signature
- 在系统边界(API网关、消息消费者)生成唯一的(UUID或ULID)
request_id - 通过请求头()或消息元数据将其传递到所有下游调用
X-Request-ID - 在每一行日志中都包含,这样单个请求的所有日志都可以关联起来
request_id - 在分布式系统中,还要传递和
trace_id(遵循W3C Trace Context格式)span_id - 使用MDC(Mapped Diagnostic Context)或等效机制,避免在每个函数签名中传递ID
What to Log
应记录的内容
- Request/response summaries: method, path, status code, duration
- State transitions: order placed, payment processed, user activated
- Decision points: why a branch was taken, which cache was hit
- Error details: stack trace, input that caused the failure, downstream error message
- Performance data: query duration, queue depth, batch size
- Lifecycle events: service started, config loaded, graceful shutdown initiated
- Retry attempts: which operation, attempt number, backoff duration
- External calls: downstream service name, response time, response status
- 请求/响应摘要:方法、路径、状态码、耗时
- 状态流转:订单提交、支付处理完成、用户激活
- 决策点:为什么走了某个分支、命中了哪个缓存
- 错误详情:堆栈跟踪、导致故障的输入、下游错误消息
- 性能数据:查询耗时、队列深度、批次大小
- 生命周期事件:服务启动、配置加载完成、开始优雅关闭
- 重试尝试:操作名称、尝试次数、退避时长
- 外部调用:下游服务名称、响应时间、响应状态
What NOT to Log
禁止记录的内容
- PII: names, emails, phone numbers, addresses, IP addresses (unless required and compliant)
- Secrets: passwords, API keys, tokens, session IDs, credit card numbers
- Sensitive business data: salary figures, health records, financial details
- High-cardinality user input: full request bodies (log a hash or truncation instead)
- When in doubt, redact or mask: ,
email=b***@example.comcard=****1234
- PII:姓名、邮箱、电话号码、地址、IP地址(除非有合规要求必须记录)
- 机密信息:密码、API密钥、令牌、会话ID、信用卡号
- 敏感业务数据:薪资数据、健康记录、财务详情
- 高基数用户输入:完整请求体(改为记录哈希值或截断内容)
- 存在疑问时,进行脱敏或掩码处理:、
email=b***@example.comcard=****1234
Log Aggregation and Rotation
日志聚合与轮转
- Ship logs to a centralized system (ELK stack, Grafana Loki, CloudWatch Logs)
- Use structured format (JSON) for ingestion; avoid multi-line logs where possible
- Set retention policies: hot storage (7-30 days searchable), warm (90 days), cold archive (1+ year if required)
- Rotate local log files by size (100MB) or time (daily) to prevent disk exhaustion
- In containers, log to stdout/stderr and let the orchestrator handle collection
- 将日志发送到中心化系统(ELK栈、Grafana Loki、CloudWatch Logs)
- 采用结构化格式(JSON)进行接入;尽可能避免多行日志
- 设置留存策略:热存储(7-30天可搜索)、温存储(90天)、冷归档(如有需要保存1年以上)
- 按大小(100MB)或时间(每日)轮转本地日志文件,避免磁盘耗尽
- 容器环境中,日志输出到stdout/stderr,由编排系统负责收集
Performance Considerations
性能注意事项
- Logging has real cost: I/O, serialization, memory allocation
- Use lazy evaluation for expensive log arguments (don't compute a debug string if debug is off)
- Avoid logging inside tight loops; aggregate and log summaries instead
- Sample high-volume logs (e.g., log 1 in 100 health check requests)
- Async log appenders prevent log I/O from blocking request processing
- 日志会产生实际成本:I/O、序列化、内存分配
- 对计算成本高的日志参数使用惰性求值(如果DEBUG未开启,就不要计算DEBUG日志的字符串)
- 避免在紧凑循环中打日志;改为聚合后记录摘要
- 对高流量日志进行采样(例如每100个健康检查请求只记录1个)
- 异步日志追加器可以避免日志I/O阻塞请求处理
Common Anti-Patterns
常见反模式
- Log and throw: logging an error and then throwing it causes duplicate log entries up the call stack
- Logging everything at INFO: makes production logs noisy and important events invisible
- String concatenation in log calls: wastes CPU when the level is disabled; use parameterized messages
- Missing context: with no details about what, where, or why
"Error occurred" - Inconsistent formats: mixing structured and unstructured logs in the same service
- Logging in a loop: prefer aggregating counts and logging a summary after the loop
- Timestamps without timezone: always use UTC with ISO 8601 format to avoid ambiguity
- Swallowing exceptions: catching an exception and logging only a generic message loses the stack trace
- 打日志后抛出异常:记录错误后再抛出异常会导致调用栈上层出现重复的日志条目
- 所有内容都用INFO级别记录:会导致生产日志杂乱,重要事件被淹没
- 日志调用中使用字符串拼接:当日志级别未启用时会浪费CPU;使用参数化消息
- 缺少上下文:仅记录,没有关于错误内容、位置、原因的任何详情
"Error occurred" - 格式不一致:同一个服务中混合使用结构化和非结构化日志
- 循环中打日志:优先在循环结束后聚合计数并记录摘要
- 时间戳不带时区:始终使用UTC时区的ISO 8601格式,避免歧义
- 吞掉异常:捕获异常后仅记录通用消息,丢失了堆栈跟踪
Examples
示例
User: "We have a REST API in Python and no structured logging. How should we set it up?"
Agent: Recommends replacing the stdlib module's default formatter with . Configures a processor chain that adds , , , , and outputs JSON. Adds middleware to generate and propagate a for every incoming request. Sets up log level via environment variable ( for production, for development). Refers to for the structlog setup pattern.
loggingstructlogtimestamplevelservicerequest_idrequest_idLOG_LEVEL=INFODEBUGreferences/structured-logging.mdUser: "I'm logging user data for debugging, is this okay?"
Agent: Reviews the log statements and identifies PII exposure (email addresses, IP addresses logged at INFO level). Recommends masking emails (), removing IP logging unless legally required, and moving detailed user data logging to DEBUG level with a note that DEBUG must never be enabled in production without a data processing agreement.
u***@domain.comUser: "Our logs are huge and costing us a fortune in storage."
Agent: Analyzes log volume by level and source. Identifies that health check endpoints generate 80% of INFO logs. Recommends: (1) reduce health check logging to DEBUG or sample at 1%, (2) move verbose middleware logs from INFO to DEBUG, (3) set retention to 14 days hot / 90 days cold, (4) add a tag for high-frequency low-value events. Estimates 60-70% cost reduction.
log.skip用户: "我们用Python开发了一个REST API,还没有结构化日志,应该怎么搭建?"
Agent: 建议将stdlib 模块的默认格式化器替换为,配置处理器链添加、、、字段并输出JSON格式。添加中间件为每个入站请求生成并传递,通过环境变量设置日志级别(生产环境,开发环境)。可参考获取structlog的搭建模式。
loggingstructlogtimestamplevelservicerequest_idrequest_idLOG_LEVEL=INFODEBUGreferences/structured-logging.md用户: "我在记录用户数据用于调试,这样可以吗?"
Agent: 评审日志语句后发现存在PII暴露风险(INFO级别记录了邮箱地址、IP地址),建议对邮箱进行掩码处理(),除非法律要求否则移除IP记录,将详细用户数据日志移到DEBUG级别,并注明如果没有数据处理协议,生产环境绝对不能开启DEBUG级别。
u***@domain.com用户: "我们的日志量太大,存储成本太高了。"
Agent: 按级别和来源分析日志量,发现健康检查接口产生了80%的INFO日志,建议:(1) 将健康检查日志降到DEBUG级别或按1%采样;(2) 将冗余的中间件日志从INFO移到DEBUG;(3) 设置留存策略为14天热存储/90天冷存储;(4) 为高频低价值事件添加标签。预计可降低60-70%的成本。
log.skip