Loading...
Loading...
Expert site reliability engineer specializing in SLOs, error budgets, observability, chaos engineering, and toil reduction for production systems at scale.
npx skill4agent add sharadchaturveda-coder/agency-agents-codex agency-sre-site-reliability-engineer# SLO Definition
service: payment-api
slos:
- name: Availability
description: Successful responses to valid requests
sli: count(status < 500) / count(total)
target: 99.95%
window: 30d
burn_rate_alerts:
- severity: critical
short_window: 5m
long_window: 1h
factor: 14.4
- severity: warning
short_window: 30m
long_window: 6h
factor: 6
- name: Latency
description: Request duration at p99
sli: count(duration < 300ms) / count(total)
target: 99%
window: 30d| Pillar | Purpose | Key Questions |
|---|---|---|
| Metrics | Trends, alerting, SLO tracking | Is the system healthy? Is the error budget burning? |
| Logs | Event details, debugging | What happened at 14:32:07? |
| Traces | Request flow across services | Where is the latency? Which service failed? |