gcloud-usage

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GCP Observability Best Practices

GCP可观测性最佳实践

Structured Logging

结构化日志

JSON Log Format

JSON日志格式

Use structured JSON logging for better queryability:
json
{
  "severity": "ERROR",
  "message": "Payment failed",
  "httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" },
  "labels": { "user_id": "123", "transaction_id": "abc" },
  "timestamp": "2025-01-15T10:30:00Z"
}
使用结构化JSON日志以提升可查询性:
json
{
  "severity": "ERROR",
  "message": "Payment failed",
  "httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" },
  "labels": { "user_id": "123", "transaction_id": "abc" },
  "timestamp": "2025-01-15T10:30:00Z"
}

Severity Levels

日志级别

Use appropriate severity for filtering:
  • DEBUG: Detailed diagnostic info
  • INFO: Normal operations, milestones
  • NOTICE: Normal but significant events
  • WARNING: Potential issues, degraded performance
  • ERROR: Failures that don't stop the service
  • CRITICAL: Failures requiring immediate action
  • ALERT: Person must take action immediately
  • EMERGENCY: System is unusable
使用合适的级别进行过滤:
  • DEBUG: 详细的诊断信息
  • INFO: 正常操作、关键节点记录
  • NOTICE: 正常但重要的事件
  • WARNING: 潜在问题、性能下降
  • ERROR: 未导致服务中断的故障
  • CRITICAL: 需要立即处理的故障
  • ALERT: 必须有人立即采取行动
  • EMERGENCY: 系统已无法使用

Log Filtering Queries

日志过滤查询

Common Filters

常用过滤器

undefined
undefined

By severity

按日志级别过滤

severity >= WARNING
severity >= WARNING

By resource

按资源类型过滤

resource.type="cloud_run_revision" resource.labels.service_name="my-service"
resource.type="cloud_run_revision" resource.labels.service_name="my-service"

By time

按时间过滤

timestamp >= "2025-01-15T00:00:00Z"
timestamp >= "2025-01-15T00:00:00Z"

By text content

按文本内容过滤

textPayload =~ "error.*timeout"
textPayload =~ "error.*timeout"

By JSON field

按JSON字段过滤

jsonPayload.user_id = "123"
jsonPayload.user_id = "123"

Combined

组合过滤

severity >= ERROR AND resource.labels.service_name="api"
undefined
severity >= ERROR AND resource.labels.service_name="api"
undefined

Advanced Queries

高级查询

undefined
undefined

Regex matching

正则匹配

textPayload =~ "status=[45][0-9]{2}"
textPayload =~ "status=[45][0-9]{2}"

Substring search

子字符串搜索

textPayload : "connection refused"
textPayload : "connection refused"

Multiple values

多值匹配

severity = (ERROR OR CRITICAL)
undefined
severity = (ERROR OR CRITICAL)
undefined

Metrics vs Logs vs Traces

指标、日志与追踪的对比

When to Use Each

适用场景

Metrics: Aggregated numeric data over time
  • Request counts, latency percentiles
  • Resource utilization (CPU, memory)
  • Business KPIs (orders/minute)
Logs: Detailed event records
  • Error details and stack traces
  • Audit trails
  • Debugging specific requests
Traces: Request flow across services
  • Latency breakdown by service
  • Identifying bottlenecks
  • Distributed system debugging
指标: 随时间变化的聚合数值数据
  • 请求量、延迟百分位数
  • 资源利用率(CPU、内存)
  • 业务关键指标(每分钟订单数)
日志: 详细的事件记录
  • 错误详情与堆栈跟踪
  • 审计日志
  • 特定请求的调试
追踪: 跨服务的请求流
  • 按服务拆分的延迟分析
  • 瓶颈识别
  • 分布式系统调试

Alert Policy Design

告警策略设计

Alert Best Practices

告警最佳实践

  • Avoid alert fatigue: Only alert on actionable issues
  • Use multi-condition alerts: Reduce noise from transient spikes
  • Set appropriate windows: 5-15 min for most metrics
  • Include runbook links: Help responders act quickly
  • 避免告警疲劳: 仅针对可处理的问题发送告警
  • 使用多条件告警: 减少瞬时峰值导致的无效告警
  • 设置合适的时间窗口: 大多数指标建议5-15分钟
  • 包含运行手册链接: 帮助响应者快速行动

Common Alert Patterns

常见告警模式

Error rate:
  • Condition: Error rate > 1% for 5 minutes
  • Good for: Service health monitoring
Latency:
  • Condition: P99 latency > 2s for 10 minutes
  • Good for: Performance degradation detection
Resource exhaustion:
  • Condition: Memory > 90% for 5 minutes
  • Good for: Capacity planning triggers
错误率:
  • 条件:错误率连续5分钟超过1%
  • 适用:服务健康监控
延迟:
  • 条件:P99延迟连续10分钟超过2秒
  • 适用:性能下降检测
资源耗尽:
  • 条件:内存使用率连续5分钟超过90%
  • 适用:容量规划触发

Cost Optimization

成本优化

Reducing Log Costs

降低日志成本

  • Exclusion filters: Drop verbose logs at ingestion
  • Sampling: Log only percentage of high-volume events
  • Shorter retention: Reduce default 30-day retention
  • Downgrade logs: Route to cheaper storage buckets
  • 排除过滤器: 在日志摄入阶段丢弃冗余日志
  • 采样: 仅记录高流量事件的一定比例
  • 缩短保留时间: 减少默认的30天保留期
  • 降级日志存储: 将日志路由到更便宜的存储桶

Exclusion Filter Examples

排除过滤器示例

undefined
undefined

Exclude health checks

排除健康检查日志

resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health"
resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health"

Exclude debug logs in production

排除生产环境中的调试日志

severity = DEBUG
undefined
severity = DEBUG
undefined

Debugging Workflow

调试工作流

  1. Start with metrics: Identify when issues started
  2. Correlate with logs: Filter logs around problem time
  3. Use traces: Follow specific requests across services
  4. Check resource logs: Look for infrastructure issues
  5. Compare baselines: Check against known-good periods
  1. 从指标入手: 确定问题开始的时间点
  2. 关联日志: 过滤问题发生时间段的日志
  3. 使用追踪: 跨服务追踪特定请求
  4. 检查资源日志: 排查基础设施问题
  5. 对比基线: 与已知正常时段进行对比