gcloud-usage

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GCP Observability Best Practices

GCP可观测性最佳实践

Structured Logging

结构化日志

JSON Log Format

JSON日志格式

Use structured JSON logging for better queryability:

json

{
  "severity": "ERROR",
  "message": "Payment failed",
  "httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" },
  "labels": { "user_id": "123", "transaction_id": "abc" },
  "timestamp": "2025-01-15T10:30:00Z"
}

使用结构化JSON日志以提升可查询性：

json

{
  "severity": "ERROR",
  "message": "Payment failed",
  "httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" },
  "labels": { "user_id": "123", "transaction_id": "abc" },
  "timestamp": "2025-01-15T10:30:00Z"
}

Severity Levels

日志级别

Use appropriate severity for filtering:

DEBUG: Detailed diagnostic info
INFO: Normal operations, milestones
NOTICE: Normal but significant events
WARNING: Potential issues, degraded performance
ERROR: Failures that don't stop the service
CRITICAL: Failures requiring immediate action
ALERT: Person must take action immediately
EMERGENCY: System is unusable

使用合适的级别进行过滤：

DEBUG: 详细的诊断信息
INFO: 正常操作、关键节点记录
NOTICE: 正常但重要的事件
WARNING: 潜在问题、性能下降
ERROR: 未导致服务中断的故障
CRITICAL: 需要立即处理的故障
ALERT: 必须有人立即采取行动
EMERGENCY: 系统已无法使用

Log Filtering Queries

日志过滤查询

Common Filters

常用过滤器

undefined

undefined

By severity

按日志级别过滤

severity >= WARNING

By resource

按资源类型过滤

resource.type="cloud_run_revision" resource.labels.service_name="my-service"

By time

按时间过滤

timestamp >= "2025-01-15T00:00:00Z"

By text content

按文本内容过滤

textPayload =~ "error.*timeout"

By JSON field

按JSON字段过滤

jsonPayload.user_id = "123"

Combined

组合过滤

severity >= ERROR AND resource.labels.service_name="api"

undefined

severity >= ERROR AND resource.labels.service_name="api"

undefined

Advanced Queries

高级查询

undefined

undefined

Regex matching

正则匹配

textPayload =~ "status=[45][0-9]{2}"

Substring search

子字符串搜索

textPayload : "connection refused"

Multiple values

多值匹配

severity = (ERROR OR CRITICAL)

undefined

severity = (ERROR OR CRITICAL)

undefined

Metrics vs Logs vs Traces

指标、日志与追踪的对比

When to Use Each

适用场景

Metrics: Aggregated numeric data over time

Request counts, latency percentiles
Resource utilization (CPU, memory)
Business KPIs (orders/minute)

Logs: Detailed event records

Error details and stack traces
Audit trails
Debugging specific requests

Traces: Request flow across services

Latency breakdown by service
Identifying bottlenecks
Distributed system debugging

指标： 随时间变化的聚合数值数据

请求量、延迟百分位数
资源利用率（CPU、内存）
业务关键指标（每分钟订单数）

日志： 详细的事件记录

错误详情与堆栈跟踪
审计日志
特定请求的调试

追踪： 跨服务的请求流

按服务拆分的延迟分析
瓶颈识别
分布式系统调试

Alert Policy Design

告警策略设计

Alert Best Practices

告警最佳实践

Avoid alert fatigue: Only alert on actionable issues
Use multi-condition alerts: Reduce noise from transient spikes
Set appropriate windows: 5-15 min for most metrics
Include runbook links: Help responders act quickly

避免告警疲劳： 仅针对可处理的问题发送告警
使用多条件告警： 减少瞬时峰值导致的无效告警
设置合适的时间窗口： 大多数指标建议5-15分钟
包含运行手册链接： 帮助响应者快速行动

Common Alert Patterns

常见告警模式

Error rate:

Condition: Error rate > 1% for 5 minutes
Good for: Service health monitoring

Latency:

Condition: P99 latency > 2s for 10 minutes
Good for: Performance degradation detection

Resource exhaustion:

Condition: Memory > 90% for 5 minutes
Good for: Capacity planning triggers

错误率：

条件：错误率连续5分钟超过1%
适用：服务健康监控

延迟：

条件：P99延迟连续10分钟超过2秒
适用：性能下降检测

资源耗尽：

条件：内存使用率连续5分钟超过90%
适用：容量规划触发

Cost Optimization

成本优化

Reducing Log Costs

降低日志成本

Exclusion filters: Drop verbose logs at ingestion
Sampling: Log only percentage of high-volume events
Shorter retention: Reduce default 30-day retention
Downgrade logs: Route to cheaper storage buckets

排除过滤器： 在日志摄入阶段丢弃冗余日志
采样： 仅记录高流量事件的一定比例
缩短保留时间： 减少默认的30天保留期
降级日志存储： 将日志路由到更便宜的存储桶

Exclusion Filter Examples

排除过滤器示例

undefined

undefined

Exclude health checks

排除健康检查日志

resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health"

Exclude debug logs in production

排除生产环境中的调试日志

severity = DEBUG

undefined

severity = DEBUG

undefined

Debugging Workflow

调试工作流

Start with metrics: Identify when issues started
Correlate with logs: Filter logs around problem time
Use traces: Follow specific requests across services
Check resource logs: Look for infrastructure issues
Compare baselines: Check against known-good periods

从指标入手： 确定问题开始的时间点
关联日志： 过滤问题发生时间段的日志
使用追踪： 跨服务追踪特定请求
检查资源日志： 排查基础设施问题
对比基线： 与已知正常时段进行对比