qa-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseQA Observability and Performance Engineering
QA可观测性与性能工程
Use telemetry (logs, metrics, traces, profiles) as a QA signal and a debugging substrate.
Core references (see ): OpenTelemetry, W3C Trace Context, and SLO practices (Google SRE).
data/sources.json将遥测数据(日志、指标、追踪、性能剖析)作为QA信号和调试基础。
核心参考资料(详见):OpenTelemetry、W3C Trace Context以及SLO实践(Google SRE)。
data/sources.jsonQuick Start (Default)
快速开始(默认配置)
If key context is missing, ask for: critical user journeys, service/dependency inventory, environments (local/staging/prod), current telemetry stack, and current SLO/SLA commitments (if any).
- Establish the minimum bar: correlation IDs + structured logs + traces + golden metrics (latency, traffic, errors, saturation).
- Verify propagation: confirm (and your request ID) flow across boundaries end-to-end.
traceparent - Make failures diagnosable: every test failure captures a trace link (or trace ID) plus the correlated logs.
- Define SLIs/SLOs and error budget policy; wire burn-rate alerts (prefer multi-window burn rates).
- Produce artifacts: a readiness checklist plus an SLO definition and alert rules (use and
assets/checklists/template-observability-readiness-checklist.md).assets/monitoring/slo/*
如果关键上下文缺失,需询问:关键用户旅程、服务/依赖清单、环境(本地/预发布/生产)、当前遥测技术栈以及当前SLO/SLA承诺(如有)。
- 确立最低标准:关联ID + 结构化日志 + 追踪 + 核心指标(延迟、流量、错误、饱和度)。
- 验证传播:确认(以及请求ID)能够跨边界端到端流转。
traceparent - 让故障可诊断:每次测试失败都要捕获追踪链接(或追踪ID)以及相关联的日志。
- 定义SLI/SLO和错误预算策略;配置燃烧率告警(优先使用多窗口燃烧率)。
- 生成交付物:就绪检查清单以及SLO定义和告警规则(使用和
assets/checklists/template-observability-readiness-checklist.md)。assets/monitoring/slo/*
Default QA stance
默认QA立场
- Treat telemetry as part of acceptance criteria (especially for integration/E2E tests).
- Require correlation: request_id + trace_id (traceparent) across boundaries.
- Prefer SLO-based release gating and burn-rate alerting over raw infra thresholds.
- Budget overhead: sampling, cardinality, retention, and cost are quality constraints.
- Redact PII/secrets by default (logs and attributes).
- 将遥测数据纳入验收标准(尤其是集成/端到端测试)。
- 要求关联:跨边界传递request_id + trace_id(traceparent)。
- 优先基于SLO的发布门限和燃烧率告警,而非原始基础设施阈值。
- 预算开销:采样、基数、保留时长和成本均属于质量约束条件。
- 默认脱敏PII/敏感信息(日志和属性中)。
Core workflows
核心工作流
- Establish the minimum bar (logs + metrics + traces + correlation).
- Instrument with OpenTelemetry (auto-instrument first, then add manual spans for key paths).
- Verify context propagation across service boundaries (traceparent in/out).
- Define SLIs/SLOs and error budget policy; wire burn-rate alerts.
- Make failures diagnosable: capture a trace link + key logs on every test failure.
- Profile and load test only after telemetry is reliable; validate against baselines.
- 确立最低标准(日志 + 指标 + 追踪 + 关联)。
- 用OpenTelemetry实现埋点(优先自动埋点,再为关键路径添加手动Span)。
- 验证跨服务边界的上下文传播(traceparent의传入和传出)。
- 定义SLI/SLO和错误预算策略;配置燃烧率告警。
- 让故障可诊断:每次测试失败时捕获追踪链接 + 关键日志。
- 仅在遥测数据可靠后再进行性能剖析和负载测试;对照基线验证结果。
Quick reference
快速参考
| Task | Recommended default | Notes |
|---|---|---|
| Tracing | OpenTelemetry + Jaeger/Tempo | Prefer OTLP exporters via Collector when possible |
| Metrics | Prometheus + Grafana | Use histograms for latency; watch cardinality |
| Logging | Structured JSON + correlation IDs | Never log secrets/PII; redact aggressively |
| Reliability gates | SLOs + error budgets + burn-rate alerts | Gate releases on sustained burn/regressions |
| Performance | Profiling + load tests + budgets | Add continuous profiling for intermittent issues |
| Zero-code visibility | eBPF (OpenTelemetry zero-code) + continuous profiling (Parca/Pyroscope) | Use when code changes are not feasible |
| 任务 | 推荐默认方案 | 说明 |
|---|---|---|
| 追踪 | OpenTelemetry + Jaeger/Tempo | 尽可能通过Collector使用OTLP导出器 |
| 指标 | Prometheus + Grafana | 用直方图统计延迟;注意基数问题 |
| 日志 | 结构化JSON + 关联ID | 绝不要记录敏感信息/PII;严格脱敏 |
| 可靠性门限 | SLOs + 错误预算 + 燃烧率告警 | 基于持续燃烧/回归情况管控发布 |
| 性能 | 性能剖析 + 负载测试 + 预算 | 添加持续性能剖析以排查间歇性问题 |
| 无代码可见性 | eBPF(OpenTelemetry无代码) + 持续性能剖析(Parca/Pyroscope) | 当无法修改代码时使用 |
Navigation
导航
Open these guides when needed:
| If the user needs... | Read | Also use |
|---|---|---|
| A minimal, production-ready baseline | | |
| Node/Python instrumentation setup | | |
| Working trace propagation across services | | |
| SLOs, burn-rate alerts, and release gates | | |
| Profiling/load testing with evidence | | |
| A maturity model and roadmap | | |
| What to avoid and how to fix it | | |
| Alert design and fatigue reduction | | |
| Dashboard hierarchy and layout | | |
| Structured logging and cost control | | |
Implementation guides (deep dives):
references/core-observability-patterns.mdreferences/opentelemetry-best-practices.mdreferences/distributed-tracing-patterns.mdreferences/slo-design-guide.mdreferences/performance-profiling-guide.mdreferences/observability-maturity-model.mdreferences/anti-patterns-best-practices.mdreferences/alerting-strategies.mdreferences/dashboard-design-patterns.mdreferences/log-aggregation-patterns.md
Templates (copy/paste):
assets/checklists/template-observability-readiness-checklist.mdassets/opentelemetry/nodejs/opentelemetry-nodejs-setup.mdassets/opentelemetry/python/opentelemetry-python-setup.mdassets/monitoring/slo/slo-definition.yamlassets/monitoring/slo/prometheus-alert-rules.yamlassets/monitoring/grafana/grafana-dashboard-slo.jsonassets/monitoring/grafana/template-grafana-dashboard-observability.jsonassets/load-testing/load-testing-k6.jsassets/load-testing/template-load-test-artillery.yamlassets/performance/frontend/template-lighthouse-ci.jsonassets/performance/backend/template-nodejs-profiling-config.js
Curated sources:
data/sources.json
需要时可查阅以下指南:
| 如果用户需要... | 阅读文档 | 同时使用 |
|---|---|---|
| 最小化生产就绪基线 | | |
| Node/Python埋点设置 | | |
| 跨服务的追踪传播实现 | | |
| SLO、燃烧率告警和发布门限 | | |
| 带证据的性能剖析/负载测试 | | |
| 成熟度模型和路线图 | | |
| 避坑指南与最佳实践 | | |
| 告警设计与告警疲劳缓解 | | |
| 仪表盘层级与布局 | | |
| 结构化日志与成本控制 | | |
实现指南(深度解析):
references/core-observability-patterns.mdreferences/opentelemetry-best-practices.mdreferences/distributed-tracing-patterns.mdreferences/slo-design-guide.mdreferences/performance-profiling-guide.mdreferences/observability-maturity-model.mdreferences/anti-patterns-best-practices.mdreferences/alerting-strategies.mdreferences/dashboard-design-patterns.mdreferences/log-aggregation-patterns.md
模板(可直接复制粘贴):
assets/checklists/template-observability-readiness-checklist.mdassets/opentelemetry/nodejs/opentelemetry-nodejs-setup.mdassets/opentelemetry/python/opentelemetry-python-setup.mdassets/monitoring/slo/slo-definition.yamlassets/monitoring/slo/prometheus-alert-rules.yamlassets/monitoring/grafana/grafana-dashboard-slo.jsonassets/monitoring/grafana/template-grafana-dashboard-observability.jsonassets/load-testing/load-testing-k6.jsassets/load-testing/template-load-test-artillery.yamlassets/performance/frontend/template-lighthouse-ci.jsonassets/performance/backend/template-nodejs-profiling-config.js
精选资料:
data/sources.json
Scope boundaries (handoffs)
范围边界(交接)
- Pure infrastructure monitoring (Kubernetes, Docker, CI/CD):
../ops-devops-platform/SKILL.md - Database query optimization (SQL tuning, indexing):
../data-sql-optimization/SKILL.md - Application-level debugging (stack traces, breakpoints):
../qa-debugging/SKILL.md - Test strategy design (coverage, test pyramids):
../qa-testing-strategy/SKILL.md - Resilience patterns (retries, circuit breakers):
../qa-resilience/SKILL.md - Architecture decisions (microservices, event-driven):
../software-architecture-design/SKILL.md
- 纯基础设施监控(Kubernetes、Docker、CI/CD):
../ops-devops-platform/SKILL.md - 数据库查询优化(SQL调优、索引):
../data-sql-optimization/SKILL.md - 应用级调试(堆栈跟踪、断点):
../qa-debugging/SKILL.md - 测试策略设计(覆盖率、测试金字塔):
../qa-testing-strategy/SKILL.md - 弹性模式(重试、断路器):
../qa-resilience/SKILL.md - 架构决策(微服务、事件驱动):
../software-architecture-design/SKILL.md
Tool selection notes (2026)
工具选择说明(2026年)
- Default to OpenTelemetry + OTLP + Collector where possible.
- Prefer burn-rate alerting against SLOs over alerting on raw infra metrics.
- Treat sampling, cardinality, and retention as part of quality (not an afterthought).
- When asked to pick vendors/tools, start from and validate time-sensitive claims with current docs/releases if the environment allows it.
data/sources.json
- 尽可能优先选择OpenTelemetry + OTLP + Collector。
- 优先针对SLO的燃烧率告警,而非基于原始基础设施指标的告警。
- 将采样、基数和保留时长视为质量的一部分(而非事后考虑项)。
- 当被要求选择供应商/工具时,从开始,若环境允许,结合当前文档/版本验证时效性强的内容。
data/sources.json
Fact-Checking
事实核查
- Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
- Prefer primary sources; report source links and dates for volatile information.
- If web access is unavailable, state the limitation and mark guidance as unverified.
- 在给出最终答案前,使用网页搜索/网页抓取验证当前外部事实、版本、定价、截止日期、法规或平台行为。
- 优先使用原始资料;为易变信息提供来源链接和日期。
- 若无法访问网页,需说明限制条件,并将指南标记为未验证。