Loading...
Loading...
Run cross-framework agent comparisons using evaluatorq from orqkit — compares any combination of agents (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. Use when comparing agents, benchmarking, or wanting side-by-side evaluation. Do NOT use when comparing only orq.ai configurations with no external agents (use run-experiment instead).
npx skill4agent add orq-ai/assistant-plugins compare-agentsevaluatorqgenerate-synthetic-dataset{ dataset_id: "..." }{ datasetId: "..." }build-evaluatorgenerate-synthetic-datasetbuild-evaluatorrun-experimentbuild-agentanalyze-trace-failuresAgent Comparison Progress:
- [ ] Phase 1: Identify agents, frameworks, and language (Python/TS)
- [ ] Phase 2: Create dataset (→ generate-synthetic-dataset)
- [ ] Phase 3: Create evaluator (→ build-evaluator)
- [ ] Phase 4: Generate comparison script
- [ ] Phase 5: Run and view results in orq.aianalyze-trace-failuresgenerate-synthetic-datasetbuild-evaluatorrun-experimentanalyze-trace-failuresOfficial documentation: Evaluatorq Tutorial
evaluatorq@orq-ai/evaluatorqORQ_API_KEY| Tool | Purpose |
|---|---|
| Find orq.ai agent keys (use |
| Create a dataset |
| Populate dataset with test cases |
| Create an LLM-as-a-judge evaluator |
ORQ_API_KEYpip install evaluatorq orq-ai-sdknpm install @orq-ai/evaluatorqsearch_entitiestype: "agent"generate-synthetic-datasetbuild-evaluatorcreate_llm_evalevaluatorq()| Experiment Type | Jobs to Include |
|---|---|
| External vs orq.ai | One external job + one orq.ai job |
| orq.ai vs orq.ai | Two orq.ai jobs with different |
| External vs external | Two external jobs (e.g., LangGraph + CrewAI) |
| Multi-agent | Three or more jobs of any type |
<EVALUATOR_ID><AGENT_KEY><experiment-name># Python
export ORQ_API_KEY="your-key"
python evaluate.py
# TypeScript
export ORQ_API_KEY="your-key"
npx tsx evaluate.tsanalyze-trace-failuresoptimize-prompt