logging-strategy

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Logging Strategy

日志策略

When to Use

适用场景

When the user is designing a logging approach for an application, reviewing existing log statements for quality, setting up log aggregation (ELK, Loki, CloudWatch), adding correlation IDs to a distributed system, or asking about what to log and what to avoid logging.

当用户需要为应用设计日志方案、评审现有日志语句的质量、搭建日志聚合系统（ELK、Loki、CloudWatch）、为分布式系统添加关联ID，或是询问应该记录哪些内容、避免记录哪些内容时使用。

Instructions

使用指南

Log Levels

日志级别

Use log levels consistently and intentionally:

TRACE — Very fine-grained diagnostic detail (loop iterations, variable state). Never enable in production.
DEBUG — Information useful during development (SQL queries, cache hits/misses, config values loaded). Off by default in production.
INFO — Normal operational events that confirm the system is working (service started, request handled, job completed). This is the default production level.
WARN — Something unexpected that the system can handle but that deserves attention (retry succeeded, deprecated API called, approaching rate limit). Not page-worthy.
ERROR — A failure that prevented an operation from completing (unhandled exception, downstream service unreachable, write failed). Should trigger an alert or ticket.
FATAL — The process cannot continue and is shutting down (missing required config, port already bound, unrecoverable state). Extremely rare.

Rule of thumb: if you're unsure between two levels, pick the lower one. Over-logging at DEBUG is better than missing context at ERROR.

请一致且有针对性地使用日志级别：

TRACE — 极细粒度的诊断细节（循环迭代、变量状态），生产环境绝对不要启用。
DEBUG — 开发过程中有用的信息（SQL查询、缓存命中/未命中、加载的配置值），生产环境默认关闭。
INFO — 确认系统正常运行的常规操作事件（服务启动、请求处理完成、任务执行完毕），这是生产环境的默认级别。
WARN — 系统可以处理但需要引起注意的意外情况（重试成功、调用了已废弃的API、即将达到速率限制），不需要触发告警通知。
ERROR — 导致操作无法完成的故障（未处理的异常、下游服务不可达、写入失败），应该触发告警或工单。
FATAL — 进程无法继续运行即将关闭（缺少必需的配置、端口已被占用、不可恢复的状态），极少出现。

经验法则：如果你不确定选两个级别中的哪一个，就选更低的那个。DEBUG级别日志过多总比ERROR级别缺少上下文要好。

Structured vs Unstructured Logging

结构化与非结构化日志

Always prefer structured logging (JSON or key-value pairs) over unstructured text:

Structured logs are machine-parseable, searchable, and filterable
Unstructured logs require regex parsing and break when message formats change

Every log line should include:

timestamp

level

service

message

, and a

request_id

trace_id

Use the reference file
```
references/structured-logging.md
```
for language-specific libraries and patterns

始终优先选择结构化日志（JSON或键值对）而非非结构化文本：

结构化日志支持机器解析、搜索和过滤
非结构化日志需要正则解析，且消息格式变化时会失效
每一行日志都应该包含：
```
timestamp
```
、
```
level
```
、
```
service
```
、
```
message
```
，以及
```
request_id
```
或
```
trace_id
```
可参考文件
```
references/structured-logging.md
```
获取特定语言的库和实践模式

Correlation IDs and Context Propagation

关联ID与上下文传播

Generate a unique
```
request_id
```
(UUID or ULID) at the system boundary (API gateway, message consumer)
Propagate it through all downstream calls via headers (
```
X-Request-ID
```
) or message metadata
Include the
```
request_id
```
in every log line so all logs for a single request can be correlated
In distributed systems, also propagate
```
trace_id
```
and
```
span_id
```
(W3C Trace Context format)
Use MDC (Mapped Diagnostic Context) or equivalent to avoid passing IDs through every function signature

在系统边界（API网关、消息消费者）生成唯一的
```
request_id
```
（UUID或ULID）
通过请求头（
```
X-Request-ID
```
）或消息元数据将其传递到所有下游调用
在每一行日志中都包含
```
request_id
```
，这样单个请求的所有日志都可以关联起来
在分布式系统中，还要传递
```
trace_id
```
和
```
span_id
```
（遵循W3C Trace Context格式）
使用MDC（Mapped Diagnostic Context）或等效机制，避免在每个函数签名中传递ID

What to Log

应记录的内容

Request/response summaries: method, path, status code, duration
State transitions: order placed, payment processed, user activated
Decision points: why a branch was taken, which cache was hit
Error details: stack trace, input that caused the failure, downstream error message
Performance data: query duration, queue depth, batch size
Lifecycle events: service started, config loaded, graceful shutdown initiated
Retry attempts: which operation, attempt number, backoff duration
External calls: downstream service name, response time, response status

请求/响应摘要：方法、路径、状态码、耗时
状态流转：订单提交、支付处理完成、用户激活
决策点：为什么走了某个分支、命中了哪个缓存
错误详情：堆栈跟踪、导致故障的输入、下游错误消息
性能数据：查询耗时、队列深度、批次大小
生命周期事件：服务启动、配置加载完成、开始优雅关闭
重试尝试：操作名称、尝试次数、退避时长
外部调用：下游服务名称、响应时间、响应状态

What NOT to Log

禁止记录的内容

PII: names, emails, phone numbers, addresses, IP addresses (unless required and compliant)
Secrets: passwords, API keys, tokens, session IDs, credit card numbers
Sensitive business data: salary figures, health records, financial details
High-cardinality user input: full request bodies (log a hash or truncation instead)
When in doubt, redact or mask:
```
email=b***@example.com
```
,
```
card=****1234
```

PII：姓名、邮箱、电话号码、地址、IP地址（除非有合规要求必须记录）
机密信息：密码、API密钥、令牌、会话ID、信用卡号
敏感业务数据：薪资数据、健康记录、财务详情
高基数用户输入：完整请求体（改为记录哈希值或截断内容）
存在疑问时，进行脱敏或掩码处理：
```
email=b***@example.com
```
、
```
card=****1234
```

Log Aggregation and Rotation

日志聚合与轮转

Ship logs to a centralized system (ELK stack, Grafana Loki, CloudWatch Logs)
Use structured format (JSON) for ingestion; avoid multi-line logs where possible
Set retention policies: hot storage (7-30 days searchable), warm (90 days), cold archive (1+ year if required)
Rotate local log files by size (100MB) or time (daily) to prevent disk exhaustion
In containers, log to stdout/stderr and let the orchestrator handle collection

将日志发送到中心化系统（ELK栈、Grafana Loki、CloudWatch Logs）
采用结构化格式（JSON）进行接入；尽可能避免多行日志
设置留存策略：热存储（7-30天可搜索）、温存储（90天）、冷归档（如有需要保存1年以上）
按大小（100MB）或时间（每日）轮转本地日志文件，避免磁盘耗尽
容器环境中，日志输出到stdout/stderr，由编排系统负责收集

Performance Considerations

性能注意事项

Logging has real cost: I/O, serialization, memory allocation
Use lazy evaluation for expensive log arguments (don't compute a debug string if debug is off)
Avoid logging inside tight loops; aggregate and log summaries instead
Sample high-volume logs (e.g., log 1 in 100 health check requests)
Async log appenders prevent log I/O from blocking request processing

日志会产生实际成本：I/O、序列化、内存分配
对计算成本高的日志参数使用惰性求值（如果DEBUG未开启，就不要计算DEBUG日志的字符串）
避免在紧凑循环中打日志；改为聚合后记录摘要
对高流量日志进行采样（例如每100个健康检查请求只记录1个）
异步日志追加器可以避免日志I/O阻塞请求处理

Common Anti-Patterns

常见反模式

Log and throw: logging an error and then throwing it causes duplicate log entries up the call stack
Logging everything at INFO: makes production logs noisy and important events invisible
String concatenation in log calls: wastes CPU when the level is disabled; use parameterized messages
Missing context:
```
"Error occurred"
```
with no details about what, where, or why
Inconsistent formats: mixing structured and unstructured logs in the same service
Logging in a loop: prefer aggregating counts and logging a summary after the loop
Timestamps without timezone: always use UTC with ISO 8601 format to avoid ambiguity
Swallowing exceptions: catching an exception and logging only a generic message loses the stack trace

打日志后抛出异常：记录错误后再抛出异常会导致调用栈上层出现重复的日志条目
所有内容都用INFO级别记录：会导致生产日志杂乱，重要事件被淹没
日志调用中使用字符串拼接：当日志级别未启用时会浪费CPU；使用参数化消息
缺少上下文：仅记录
```
"Error occurred"
```
，没有关于错误内容、位置、原因的任何详情
格式不一致：同一个服务中混合使用结构化和非结构化日志
循环中打日志：优先在循环结束后聚合计数并记录摘要
时间戳不带时区：始终使用UTC时区的ISO 8601格式，避免歧义
吞掉异常：捕获异常后仅记录通用消息，丢失了堆栈跟踪

Examples

示例

User: "We have a REST API in Python and no structured logging. How should we set it up?" Agent: Recommends replacing the stdlib

logging

module's default formatter with

structlog

. Configures a processor chain that adds

timestamp

level

service

request_id

, and outputs JSON. Adds middleware to generate and propagate a

request_id

for every incoming request. Sets up log level via environment variable (

LOG_LEVEL=INFO

for production,

DEBUG

for development). Refers to

references/structured-logging.md

for the structlog setup pattern.

User: "I'm logging user data for debugging, is this okay?" Agent: Reviews the log statements and identifies PII exposure (email addresses, IP addresses logged at INFO level). Recommends masking emails (

u***@domain.com

), removing IP logging unless legally required, and moving detailed user data logging to DEBUG level with a note that DEBUG must never be enabled in production without a data processing agreement.

User: "Our logs are huge and costing us a fortune in storage." Agent: Analyzes log volume by level and source. Identifies that health check endpoints generate 80% of INFO logs. Recommends: (1) reduce health check logging to DEBUG or sample at 1%, (2) move verbose middleware logs from INFO to DEBUG, (3) set retention to 14 days hot / 90 days cold, (4) add a

log.skip

tag for high-frequency low-value events. Estimates 60-70% cost reduction.

用户： "我们用Python开发了一个REST API，还没有结构化日志，应该怎么搭建？" Agent： 建议将stdlib

logging

模块的默认格式化器替换为

structlog

，配置处理器链添加

timestamp

、

level

、

service

、

request_id

字段并输出JSON格式。添加中间件为每个入站请求生成并传递

request_id

，通过环境变量设置日志级别（生产环境

LOG_LEVEL=INFO

，开发环境

DEBUG

）。可参考

references/structured-logging.md

获取structlog的搭建模式。

用户： "我在记录用户数据用于调试，这样可以吗？" Agent： 评审日志语句后发现存在PII暴露风险（INFO级别记录了邮箱地址、IP地址），建议对邮箱进行掩码处理（

u***@domain.com

），除非法律要求否则移除IP记录，将详细用户数据日志移到DEBUG级别，并注明如果没有数据处理协议，生产环境绝对不能开启DEBUG级别。

用户： "我们的日志量太大，存储成本太高了。" Agent： 按级别和来源分析日志量，发现健康检查接口产生了80%的INFO日志，建议：(1) 将健康检查日志降到DEBUG级别或按1%采样；(2) 将冗余的中间件日志从INFO移到DEBUG；(3) 设置留存策略为14天热存储/90天冷存储；(4) 为高频低价值事件添加

log.skip

标签。预计可降低60-70%的成本。