Loading...
Loading...
Found 93 Skills
Implements error handling patterns, structured logging, retry strategies, circuit breakers, and graceful degradation. Use when designing error handling, setting up logging, implementing retries, adding error tracking, or when asked about error boundaries, log aggregation, alerting, or resilience patterns.
Validate, lint, audit, or fix PromQL queries and alerting rules; detects anti-patterns.
SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).
Set up comprehensive observability for Mistral AI integrations with metrics, traces, and alerts. Use when implementing monitoring for Mistral AI operations, setting up dashboards, or configuring alerting for Mistral AI integration health. Trigger with phrases like "mistral monitoring", "mistral metrics", "mistral observability", "monitor mistral", "mistral alerts", "mistral tracing".
Golang everyday observability — the always-on signals in production. Covers structured logging with slog, Prometheus metrics, OpenTelemetry distributed tracing, continuous profiling with pprof/Pyroscope, server-side RUM event tracking, alerting, and Grafana dashboards. Apply when instrumenting Go services for production monitoring, setting up metrics or alerting, adding OpenTelemetry tracing, correlating logs with traces, migrating legacy loggers (zap/logrus/zerolog) to slog, adding observability to new features, or implementing GDPR/CCPA-compliant tracking with Customer Data Platforms (CDP). Not for temporary deep-dive performance investigation (→ See golang-benchmark and golang-performance skills).
Use this skill whenever planning, designing, reviewing, or improving search and recommendation systems for a two-sided trust marketplace built on OpenSearch — covers user-intent framing, product-surface architecture, index design, query understanding, retrieval strategy, ranking, search-plus-recs blending, measurement, and a dashboard-and-alerting layer for ongoing decision making. Triggers on tasks involving marketplace search, homefeeds, ranking, relevance tuning, OpenSearch query DSL, analyzers, synonyms, golden sets, NDCG, A/B testing, or diagnosing an existing retrieval system. Use this skill BEFORE marketplace-personalisation when planning new work; hand off when the diagnosed bottleneck is personalisation-specific.
Comprehensive logging and observability patterns for production systems including structured logging, distributed tracing, metrics collection, log aggregation, and alerting. Triggers for this skill - log, logging, logs, trace, tracing, traces, metrics, observability, OpenTelemetry, OTEL, Jaeger, Zipkin, structured logging, log level, debug, info, warn, error, fatal, correlation ID, span, spans, ELK, Elasticsearch, Loki, Datadog, Prometheus, Grafana, distributed tracing, log aggregation, alerting, monitoring, JSON logs, telemetry.
Prometheus monitoring and alerting with PromQL. Use for metrics collection.
Deployment & Operations Expert responsible for securely, rollbackable, and observably deploying builds that pass Reviewer and QA gates to servers (PM2 3-process cluster + Nginx reverse proxy + BT Panel). Adheres to engineering baselines including zero-downtime deployment, health checks, rollback within ≤3 minutes, and post-release smoke testing. Handles deployment orchestration, configuration management, traffic management, and monitoring & alerting. Applicable when receiving task cards from the Deploy department or needing to release to production.
Grafana OnCall and Incident Response Management (IRM) — alert routing, escalation chains, on-call schedules, Jinja2 routing templates, Slack/mobile notifications, integrations (Alertmanager, Grafana Alerting, webhooks, PagerDuty), and incident lifecycle management. Use when setting up on-call rotations, configuring escalation policies, routing alerts to the right team, declaring and managing incidents, integrating with Alertmanager or Grafana Alerting, or configuring Slack-based alert workflows.
DataWorks Operations Center assistant for task and workflow operations, alert rule creation and management. Covers troubleshooting, failure recovery, baseline assurance, monitoring and alerting. Supports periodic, manual, and triggered tasks/workflows (excludes real-time/streaming tasks). Uses aliyun CLI to call dataworks-public OpenAPI (2024-05-18). Trigger keywords: query task, task instance, instance log, workflow, workflow instance, alert rule, operations center, task failure, instance status, upstream/downstream dependency, rerun, monitoring alert, custom monitoring, alert rule, task instance, workflow instance, operation log, baseline assurance, failure recovery, DataWorks operations. Do NOT trigger: data source management, compute resources, resource groups, data development, MaxCompute table management, ECS/RDS/OSS operations, workspace member management, data quality, data lineage, data preview.
This skill should be used when the user asks to "manage alerts", "create alert", "list alerts", "check alert status", "enable alert", "disable alert", "investigate firing alerts", "check which alerts are active", "find alerting rules", "set up an alert", "configure alerting", "mute an alert", "silence an alert", "see alert definitions", "check alert priority", or wants to manage Coralogix alert definitions using the cx CLI.