gcloud-usage
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGCP Observability Best Practices
GCP可观测性最佳实践
Structured Logging
结构化日志
JSON Log Format
JSON日志格式
Use structured JSON logging for better queryability:
json
{
"severity": "ERROR",
"message": "Payment failed",
"httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" },
"labels": { "user_id": "123", "transaction_id": "abc" },
"timestamp": "2025-01-15T10:30:00Z"
}使用结构化JSON日志以提升可查询性:
json
{
"severity": "ERROR",
"message": "Payment failed",
"httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" },
"labels": { "user_id": "123", "transaction_id": "abc" },
"timestamp": "2025-01-15T10:30:00Z"
}Severity Levels
日志级别
Use appropriate severity for filtering:
- DEBUG: Detailed diagnostic info
- INFO: Normal operations, milestones
- NOTICE: Normal but significant events
- WARNING: Potential issues, degraded performance
- ERROR: Failures that don't stop the service
- CRITICAL: Failures requiring immediate action
- ALERT: Person must take action immediately
- EMERGENCY: System is unusable
使用合适的级别进行过滤:
- DEBUG: 详细的诊断信息
- INFO: 正常操作、关键节点记录
- NOTICE: 正常但重要的事件
- WARNING: 潜在问题、性能下降
- ERROR: 未导致服务中断的故障
- CRITICAL: 需要立即处理的故障
- ALERT: 必须有人立即采取行动
- EMERGENCY: 系统已无法使用
Log Filtering Queries
日志过滤查询
Common Filters
常用过滤器
undefinedundefinedBy severity
按日志级别过滤
severity >= WARNING
severity >= WARNING
By resource
按资源类型过滤
resource.type="cloud_run_revision"
resource.labels.service_name="my-service"
resource.type="cloud_run_revision"
resource.labels.service_name="my-service"
By time
按时间过滤
timestamp >= "2025-01-15T00:00:00Z"
timestamp >= "2025-01-15T00:00:00Z"
By text content
按文本内容过滤
textPayload =~ "error.*timeout"
textPayload =~ "error.*timeout"
By JSON field
按JSON字段过滤
jsonPayload.user_id = "123"
jsonPayload.user_id = "123"
Combined
组合过滤
severity >= ERROR AND resource.labels.service_name="api"
undefinedseverity >= ERROR AND resource.labels.service_name="api"
undefinedAdvanced Queries
高级查询
undefinedundefinedRegex matching
正则匹配
textPayload =~ "status=[45][0-9]{2}"
textPayload =~ "status=[45][0-9]{2}"
Substring search
子字符串搜索
textPayload : "connection refused"
textPayload : "connection refused"
Multiple values
多值匹配
severity = (ERROR OR CRITICAL)
undefinedseverity = (ERROR OR CRITICAL)
undefinedMetrics vs Logs vs Traces
指标、日志与追踪的对比
When to Use Each
适用场景
Metrics: Aggregated numeric data over time
- Request counts, latency percentiles
- Resource utilization (CPU, memory)
- Business KPIs (orders/minute)
Logs: Detailed event records
- Error details and stack traces
- Audit trails
- Debugging specific requests
Traces: Request flow across services
- Latency breakdown by service
- Identifying bottlenecks
- Distributed system debugging
指标: 随时间变化的聚合数值数据
- 请求量、延迟百分位数
- 资源利用率(CPU、内存)
- 业务关键指标(每分钟订单数)
日志: 详细的事件记录
- 错误详情与堆栈跟踪
- 审计日志
- 特定请求的调试
追踪: 跨服务的请求流
- 按服务拆分的延迟分析
- 瓶颈识别
- 分布式系统调试
Alert Policy Design
告警策略设计
Alert Best Practices
告警最佳实践
- Avoid alert fatigue: Only alert on actionable issues
- Use multi-condition alerts: Reduce noise from transient spikes
- Set appropriate windows: 5-15 min for most metrics
- Include runbook links: Help responders act quickly
- 避免告警疲劳: 仅针对可处理的问题发送告警
- 使用多条件告警: 减少瞬时峰值导致的无效告警
- 设置合适的时间窗口: 大多数指标建议5-15分钟
- 包含运行手册链接: 帮助响应者快速行动
Common Alert Patterns
常见告警模式
Error rate:
- Condition: Error rate > 1% for 5 minutes
- Good for: Service health monitoring
Latency:
- Condition: P99 latency > 2s for 10 minutes
- Good for: Performance degradation detection
Resource exhaustion:
- Condition: Memory > 90% for 5 minutes
- Good for: Capacity planning triggers
错误率:
- 条件:错误率连续5分钟超过1%
- 适用:服务健康监控
延迟:
- 条件:P99延迟连续10分钟超过2秒
- 适用:性能下降检测
资源耗尽:
- 条件:内存使用率连续5分钟超过90%
- 适用:容量规划触发
Cost Optimization
成本优化
Reducing Log Costs
降低日志成本
- Exclusion filters: Drop verbose logs at ingestion
- Sampling: Log only percentage of high-volume events
- Shorter retention: Reduce default 30-day retention
- Downgrade logs: Route to cheaper storage buckets
- 排除过滤器: 在日志摄入阶段丢弃冗余日志
- 采样: 仅记录高流量事件的一定比例
- 缩短保留时间: 减少默认的30天保留期
- 降级日志存储: 将日志路由到更便宜的存储桶
Exclusion Filter Examples
排除过滤器示例
undefinedundefinedExclude health checks
排除健康检查日志
resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health"
resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health"
Exclude debug logs in production
排除生产环境中的调试日志
severity = DEBUG
undefinedseverity = DEBUG
undefinedDebugging Workflow
调试工作流
- Start with metrics: Identify when issues started
- Correlate with logs: Filter logs around problem time
- Use traces: Follow specific requests across services
- Check resource logs: Look for infrastructure issues
- Compare baselines: Check against known-good periods
- 从指标入手: 确定问题开始的时间点
- 关联日志: 过滤问题发生时间段的日志
- 使用追踪: 跨服务追踪特定请求
- 检查资源日志: 排查基础设施问题
- 对比基线: 与已知正常时段进行对比