Search Results: model-benchmarking

Found 3 Skills

agent-collaboration

Multi-model agent orchestration using specialized agents for planning, coding, research, math/science, visual analysis, and adversarial review. Use when tasks are complex enough to benefit from different models' strengths, when you want adversarial review to catch blind spots, or when coordinating multi-step workflows across agent roles. Triggers on complex projects, multi-step tasks, architecture decisions, or when explicitly requested.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningvuralserhat86/antigravity...

llm_evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

🇺🇸|EnglishTranslated

AI & Machine Learninggarrytan/gstack

benchmark-models

Cross-model benchmark for gstack skills. Runs the same prompt through Claude, GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost, and optionally quality via LLM judge. Answers "which model is actually best for this skill?" with data instead of vibes. Separate from /benchmark, which measures web page performance. Use when: "benchmark models", "compare models", "which model is best for X", "cross-model comparison", "model shootout". (gstack) Voice triggers (speech-to-text aliases): "compare models", "model shootout", "which model is best".

🇺🇸|EnglishTranslated