logging-strategy

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Logging Strategy

日志策略

When to Use

适用场景

When the user is designing a logging approach for an application, reviewing existing log statements for quality, setting up log aggregation (ELK, Loki, CloudWatch), adding correlation IDs to a distributed system, or asking about what to log and what to avoid logging.
当用户需要为应用设计日志方案、评审现有日志语句的质量、搭建日志聚合系统(ELK、Loki、CloudWatch)、为分布式系统添加关联ID,或是询问应该记录哪些内容、避免记录哪些内容时使用。

Instructions

使用指南

Log Levels

日志级别

Use log levels consistently and intentionally:
  1. TRACE — Very fine-grained diagnostic detail (loop iterations, variable state). Never enable in production.
  2. DEBUG — Information useful during development (SQL queries, cache hits/misses, config values loaded). Off by default in production.
  3. INFO — Normal operational events that confirm the system is working (service started, request handled, job completed). This is the default production level.
  4. WARN — Something unexpected that the system can handle but that deserves attention (retry succeeded, deprecated API called, approaching rate limit). Not page-worthy.
  5. ERROR — A failure that prevented an operation from completing (unhandled exception, downstream service unreachable, write failed). Should trigger an alert or ticket.
  6. FATAL — The process cannot continue and is shutting down (missing required config, port already bound, unrecoverable state). Extremely rare.
Rule of thumb: if you're unsure between two levels, pick the lower one. Over-logging at DEBUG is better than missing context at ERROR.
请一致且有针对性地使用日志级别:
  1. TRACE — 极细粒度的诊断细节(循环迭代、变量状态),生产环境绝对不要启用。
  2. DEBUG — 开发过程中有用的信息(SQL查询、缓存命中/未命中、加载的配置值),生产环境默认关闭。
  3. INFO — 确认系统正常运行的常规操作事件(服务启动、请求处理完成、任务执行完毕),这是生产环境的默认级别。
  4. WARN — 系统可以处理但需要引起注意的意外情况(重试成功、调用了已废弃的API、即将达到速率限制),不需要触发告警通知。
  5. ERROR — 导致操作无法完成的故障(未处理的异常、下游服务不可达、写入失败),应该触发告警或工单。
  6. FATAL — 进程无法继续运行即将关闭(缺少必需的配置、端口已被占用、不可恢复的状态),极少出现。
经验法则:如果你不确定选两个级别中的哪一个,就选更低的那个。DEBUG级别日志过多总比ERROR级别缺少上下文要好。

Structured vs Unstructured Logging

结构化与非结构化日志

Always prefer structured logging (JSON or key-value pairs) over unstructured text:
  • Structured logs are machine-parseable, searchable, and filterable
  • Unstructured logs require regex parsing and break when message formats change
  • Every log line should include:
    timestamp
    ,
    level
    ,
    service
    ,
    message
    , and a
    request_id
    or
    trace_id
  • Use the reference file
    references/structured-logging.md
    for language-specific libraries and patterns
始终优先选择结构化日志(JSON或键值对)而非非结构化文本:
  • 结构化日志支持机器解析、搜索和过滤
  • 非结构化日志需要正则解析,且消息格式变化时会失效
  • 每一行日志都应该包含:
    timestamp
    level
    service
    message
    ,以及
    request_id
    trace_id
  • 可参考文件
    references/structured-logging.md
    获取特定语言的库和实践模式

Correlation IDs and Context Propagation

关联ID与上下文传播

  1. Generate a unique
    request_id
    (UUID or ULID) at the system boundary (API gateway, message consumer)
  2. Propagate it through all downstream calls via headers (
    X-Request-ID
    ) or message metadata
  3. Include the
    request_id
    in every log line so all logs for a single request can be correlated
  4. In distributed systems, also propagate
    trace_id
    and
    span_id
    (W3C Trace Context format)
  5. Use MDC (Mapped Diagnostic Context) or equivalent to avoid passing IDs through every function signature
  1. 在系统边界(API网关、消息消费者)生成唯一的
    request_id
    (UUID或ULID)
  2. 通过请求头(
    X-Request-ID
    )或消息元数据将其传递到所有下游调用
  3. 在每一行日志中都包含
    request_id
    ,这样单个请求的所有日志都可以关联起来
  4. 在分布式系统中,还要传递
    trace_id
    span_id
    (遵循W3C Trace Context格式)
  5. 使用MDC(Mapped Diagnostic Context)或等效机制,避免在每个函数签名中传递ID

What to Log

应记录的内容

  • Request/response summaries: method, path, status code, duration
  • State transitions: order placed, payment processed, user activated
  • Decision points: why a branch was taken, which cache was hit
  • Error details: stack trace, input that caused the failure, downstream error message
  • Performance data: query duration, queue depth, batch size
  • Lifecycle events: service started, config loaded, graceful shutdown initiated
  • Retry attempts: which operation, attempt number, backoff duration
  • External calls: downstream service name, response time, response status
  • 请求/响应摘要:方法、路径、状态码、耗时
  • 状态流转:订单提交、支付处理完成、用户激活
  • 决策点:为什么走了某个分支、命中了哪个缓存
  • 错误详情:堆栈跟踪、导致故障的输入、下游错误消息
  • 性能数据:查询耗时、队列深度、批次大小
  • 生命周期事件:服务启动、配置加载完成、开始优雅关闭
  • 重试尝试:操作名称、尝试次数、退避时长
  • 外部调用:下游服务名称、响应时间、响应状态

What NOT to Log

禁止记录的内容

  • PII: names, emails, phone numbers, addresses, IP addresses (unless required and compliant)
  • Secrets: passwords, API keys, tokens, session IDs, credit card numbers
  • Sensitive business data: salary figures, health records, financial details
  • High-cardinality user input: full request bodies (log a hash or truncation instead)
  • When in doubt, redact or mask:
    email=b***@example.com
    ,
    card=****1234
  • PII:姓名、邮箱、电话号码、地址、IP地址(除非有合规要求必须记录)
  • 机密信息:密码、API密钥、令牌、会话ID、信用卡号
  • 敏感业务数据:薪资数据、健康记录、财务详情
  • 高基数用户输入:完整请求体(改为记录哈希值或截断内容)
  • 存在疑问时,进行脱敏或掩码处理:
    email=b***@example.com
    card=****1234

Log Aggregation and Rotation

日志聚合与轮转

  1. Ship logs to a centralized system (ELK stack, Grafana Loki, CloudWatch Logs)
  2. Use structured format (JSON) for ingestion; avoid multi-line logs where possible
  3. Set retention policies: hot storage (7-30 days searchable), warm (90 days), cold archive (1+ year if required)
  4. Rotate local log files by size (100MB) or time (daily) to prevent disk exhaustion
  5. In containers, log to stdout/stderr and let the orchestrator handle collection
  1. 将日志发送到中心化系统(ELK栈、Grafana Loki、CloudWatch Logs)
  2. 采用结构化格式(JSON)进行接入;尽可能避免多行日志
  3. 设置留存策略:热存储(7-30天可搜索)、温存储(90天)、冷归档(如有需要保存1年以上)
  4. 按大小(100MB)或时间(每日)轮转本地日志文件,避免磁盘耗尽
  5. 容器环境中,日志输出到stdout/stderr,由编排系统负责收集

Performance Considerations

性能注意事项

  • Logging has real cost: I/O, serialization, memory allocation
  • Use lazy evaluation for expensive log arguments (don't compute a debug string if debug is off)
  • Avoid logging inside tight loops; aggregate and log summaries instead
  • Sample high-volume logs (e.g., log 1 in 100 health check requests)
  • Async log appenders prevent log I/O from blocking request processing
  • 日志会产生实际成本:I/O、序列化、内存分配
  • 对计算成本高的日志参数使用惰性求值(如果DEBUG未开启,就不要计算DEBUG日志的字符串)
  • 避免在紧凑循环中打日志;改为聚合后记录摘要
  • 对高流量日志进行采样(例如每100个健康检查请求只记录1个)
  • 异步日志追加器可以避免日志I/O阻塞请求处理

Common Anti-Patterns

常见反模式

  • Log and throw: logging an error and then throwing it causes duplicate log entries up the call stack
  • Logging everything at INFO: makes production logs noisy and important events invisible
  • String concatenation in log calls: wastes CPU when the level is disabled; use parameterized messages
  • Missing context:
    "Error occurred"
    with no details about what, where, or why
  • Inconsistent formats: mixing structured and unstructured logs in the same service
  • Logging in a loop: prefer aggregating counts and logging a summary after the loop
  • Timestamps without timezone: always use UTC with ISO 8601 format to avoid ambiguity
  • Swallowing exceptions: catching an exception and logging only a generic message loses the stack trace
  • 打日志后抛出异常:记录错误后再抛出异常会导致调用栈上层出现重复的日志条目
  • 所有内容都用INFO级别记录:会导致生产日志杂乱,重要事件被淹没
  • 日志调用中使用字符串拼接:当日志级别未启用时会浪费CPU;使用参数化消息
  • 缺少上下文:仅记录
    "Error occurred"
    ,没有关于错误内容、位置、原因的任何详情
  • 格式不一致:同一个服务中混合使用结构化和非结构化日志
  • 循环中打日志:优先在循环结束后聚合计数并记录摘要
  • 时间戳不带时区:始终使用UTC时区的ISO 8601格式,避免歧义
  • 吞掉异常:捕获异常后仅记录通用消息,丢失了堆栈跟踪

Examples

示例

User: "We have a REST API in Python and no structured logging. How should we set it up?" Agent: Recommends replacing the stdlib
logging
module's default formatter with
structlog
. Configures a processor chain that adds
timestamp
,
level
,
service
,
request_id
, and outputs JSON. Adds middleware to generate and propagate a
request_id
for every incoming request. Sets up log level via environment variable (
LOG_LEVEL=INFO
for production,
DEBUG
for development). Refers to
references/structured-logging.md
for the structlog setup pattern.
User: "I'm logging user data for debugging, is this okay?" Agent: Reviews the log statements and identifies PII exposure (email addresses, IP addresses logged at INFO level). Recommends masking emails (
u***@domain.com
), removing IP logging unless legally required, and moving detailed user data logging to DEBUG level with a note that DEBUG must never be enabled in production without a data processing agreement.
User: "Our logs are huge and costing us a fortune in storage." Agent: Analyzes log volume by level and source. Identifies that health check endpoints generate 80% of INFO logs. Recommends: (1) reduce health check logging to DEBUG or sample at 1%, (2) move verbose middleware logs from INFO to DEBUG, (3) set retention to 14 days hot / 90 days cold, (4) add a
log.skip
tag for high-frequency low-value events. Estimates 60-70% cost reduction.
用户: "我们用Python开发了一个REST API,还没有结构化日志,应该怎么搭建?" Agent: 建议将stdlib
logging
模块的默认格式化器替换为
structlog
,配置处理器链添加
timestamp
level
service
request_id
字段并输出JSON格式。添加中间件为每个入站请求生成并传递
request_id
,通过环境变量设置日志级别(生产环境
LOG_LEVEL=INFO
,开发环境
DEBUG
)。可参考
references/structured-logging.md
获取structlog的搭建模式。
用户: "我在记录用户数据用于调试,这样可以吗?" Agent: 评审日志语句后发现存在PII暴露风险(INFO级别记录了邮箱地址、IP地址),建议对邮箱进行掩码处理(
u***@domain.com
),除非法律要求否则移除IP记录,将详细用户数据日志移到DEBUG级别,并注明如果没有数据处理协议,生产环境绝对不能开启DEBUG级别。
用户: "我们的日志量太大,存储成本太高了。" Agent: 按级别和来源分析日志量,发现健康检查接口产生了80%的INFO日志,建议:(1) 将健康检查日志降到DEBUG级别或按1%采样;(2) 将冗余的中间件日志从INFO移到DEBUG;(3) 设置留存策略为14天热存储/90天冷存储;(4) 为高频低价值事件添加
log.skip
标签。预计可降低60-70%的成本。