Search Results: sre

Found 38 Skills

DevOps & Cloud Servicesaliyun/alibabacloud-aiops...

alibabacloud-emr-starrocks-manage

Manage the full lifecycle of Alibaba Cloud EMR Serverless StarRocks instances — create, scale, configure, maintain and diagnose. Use this Skill when operations engineers, SREs, or architects need to manage StarRocks instances. Typical scenarios include: "create a StarRocks", "check instance status", "scale up CU", "modify configuration", "restart instance", "diagnose issues", etc. Not applicable for: writing SQL/DDL, data import/export, query tuning, materialized view configuration, or managing non-StarRocks products (EMR clusters, Spark, Milvus, ClickHouse, Doris, RDS, ECS).

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesabsolutelyskilled/absolut...

site-reliability

Use this skill when implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability. Triggers on SRE, error budgets, SLOs, SLAs, toil automation, incident management, postmortems, on-call rotation, capacity planning, chaos engineering, and any task requiring reliability engineering decisions.

🇺🇸|EnglishTranslated

AI & Machine Learningdaemon-blockint-tech/agen...

ai-lead-ops

Guides AI ops leadership—LLM SRE, model/prompt releases, eval/incidents, cost/capacity, vendors, and cross-functional cadence. Use for AI platform ops, LLM SLAs, incidents, rollout governance, unit economics, red-team/eval gates, and team rituals—not memory (ai-memory-developer), context code (ai-context-engineer), security programs (cybersecurity), token roadmaps (ai-token-improvement-plan-engineer), solution architecture (applied-ai-architect-commercial-enterprise), skills portfolio (ai-skill-manager), or vertical AI product eng management (engineering-manager-vertical-ai-products). Prompt/eval team management and golden-set release policy: engineering-manager-agent-prompts-evals. Safeguard inference platform: ml-infrastructure-engineer-safeguards. Safeguard model research: ml-research-engineer-safeguards.

🇺🇸|EnglishTranslated

DevOps & Cloud Serviceswshobson/agents

slo-implementation

Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.

🇺🇸|EnglishTranslated

Backend Developmentdaemon-blockint-tech/agen...

high-concurrency-scalability

Design and optimize systems for high concurrency, throughput, scalability, and elastic scale—concurrency models (threads, async/await, actors), lock-free patterns, connection pooling, caching stampede mitigation, horizontal scaling, load balancing, backpressure, queueing, rate limiting, bulkheads, read replicas, sharding, pool tuning, profiling, capacity planning, SLO-driven autoscaling, multi-region and CDN edge architecture. Use when the user asks about high concurrency, scalability, throughput, horizontal scaling, connection pooling, backpressure, rate limiting, caching stampede, read replica, sharding, autoscaling, capacity planning, lock contention, async scalability, or load balancing—not service decomposition (microservices-developer), event buses only (event-driven-architecture), generic CRUD (senior-software-engineer), SRE on-call only (site-reliability-engineer), load tests without architecture (performance-engineer), or cost-only FinOps (cloud-economist).

🇺🇸|EnglishTranslated

AI & Machine Learningjezweb/claude-skills

claude-agent-sdk

Build autonomous AI agents with Claude Agent SDK. Structured outputs guarantee JSON schema validation, with plugins system and hooks for event-driven workflows. Prevents 14 documented errors. Use when: building coding agents, SRE systems, security auditors, or troubleshooting CLI not found, structured output validation, session forking errors, MCP config issues, subagent cleanup.

🇺🇸|EnglishTranslated

10 scripts/Attention

DevOps & Cloud Servicesmicrosoftdocs/agent-skill...

azure-reliability

Expert knowledge for Azure Reliability development including best practices, decision making, architecture & design patterns, limits & quotas, and deployment. Use when designing AZ zone/zone-redundant setups, resilient Functions, AKS, MySQL HA migrations, or Queue Storage limits, and other Azure Reliability related development tasks. Not for Azure Resiliency (use azure-resiliency), Azure Monitor (use azure-monitor), Azure Service Health (use azure-service-health), Azure Sre Agent (use azure-sre-agent).

🇺🇸|EnglishTranslated

Testing & QAdaemon-blockint-tech/agen...

zero-tolerance-for-failure

Guides failure-prevention culture and operational excellence for mission-critical engineering— zero-defect aspiration vs error budgets; HRO principles; defense-in-depth; fail-safe/fail-closed; verification gates and independent checks; redundancy and graceful degradation; pre-mortems and FMEA; stop-the-line; defect escape, near-miss, and repeat-incident metrics; leadership against normalization of deviance—not blame culture. Use for failure-prevention programs, HRO practices, verification gates, fail-safe design, pre-mortem/FMEA, stop-the-line, near-miss reporting, or defect-escape metrics—not SRE error budgets only (site-reliability-engineer), incident command only (incident-management-engineer), backup/restore only (cyber-resilience-engineer), CI lint only (build-validator), agile coaching, HR discipline, or classified ATO without ops-excellence lens (classified-cyber-security-senior-manager).

🇺🇸|EnglishTranslated

Code Qualityswyxio/skills

antislop-codebase

Analyze and transform messy, prototype, overgrown, slop-prone, or hard-to-maintain software repositories into maintainable product-shaped codebases while preserving existing product behavior. Use when the user asks to antislop a codebase, clean up a messy repo, run a maintainability migration, write a refactor plan, modernize structure, improve TypeScript/type boundaries, harden tests, reduce large files, clean architecture, coordinate subagent-driven refactors, or produce a final migration audit/report/microsite. Do not use for broader production-readiness specialties such as security audits, observability/logging programs, compliance hardening, SRE/runbook work, or reliability engineering unless the user explicitly scopes those as part of the maintainability refactor.

🇺🇸|EnglishTranslated

AI & Machine Learningrysweet/amplihack

goal-seeking-agent-pattern

Guides architects on when and how to use goal-seeking agents as a design pattern. This skill helps evaluate whether autonomous agents are appropriate for a given problem, how to structure their objectives, integrate with goal_agent_generator, and reference real amplihack examples like AKS SRE automation, CI diagnostics, pre-commit workflows, and fix-agent pattern matching.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesdawiddutoit/custom-claude

clickhouse-operations

Complete ClickHouse operations guide for DevOps and SRE teams managing production deployments. Provides practical guidance on monitoring essential metrics (query latency, throughput, memory, disk), introspecting system tables, performance analysis, scaling strategies (vertical and horizontal), backup/disaster recovery, tuning at query/server/table levels, and troubleshooting common issues. Use when diagnosing ClickHouse problems, optimizing performance, planning capacity, setting up monitoring, implementing backups, or managing production clusters. Includes resource management strategies for disk space, connections, and background operations plus production checklists.

🇺🇸|EnglishTranslated

Tools & Utilitiestrainwithshubham/skills

resume-review

Review a resume like a professional resume writer AND an ATS (Applicant Tracking System) — produce an ATS score (0-100), a level detection (fresher / mid / senior), a breakdown of what's working, what's failing, the 7 Deadly Sins check, priority fixes, concrete before→after bullet rewrites, and missing keywords. Use this skill whenever the user asks to review, audit, critique, score, rate, roast, or improve a resume or CV, or types /resume-review, or drops a resume PDF with a job description. Trigger EVEN IF the user just says "look at my CV", "check my resume", "is this resume good", "why am I not getting interviews", "roast my resume", or shares a resume file without explicit review instructions — any interaction involving a resume file or CV should invoke this skill. Especially strong for DevOps, SRE, Cloud, Platform, and Infrastructure engineering resumes, but works for any technical resume.

🇺🇸|EnglishTranslated