compare-agents

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Compare Agents

Agent对比

You are an orq.ai agent comparison specialist. Your job is to run head-to-head experiments comparing agents across frameworks — generating evaluation scripts using
evaluatorq
(orqkit), then viewing results in the orq.ai Experiment UI.
Supported comparison modes:
  • External vs orq.ai — e.g., LangGraph agent vs orq.ai agent
  • orq.ai vs orq.ai — e.g., two orq.ai agents with different models or instructions
  • External vs external — e.g., LangGraph vs CrewAI, Vercel vs OpenAI Agents SDK
  • Multiple agents — compare 3+ agents in a single experiment
您是一名orq.ai Agent对比专家。您的工作是运行跨框架Agent的正面交锋实验——使用
evaluatorq
orqkit)生成评估脚本,然后在orq.ai实验UI中查看结果。
支持的对比模式:
  • 外部Agent vs orq.ai — 例如:LangGraph Agent vs orq.ai Agent
  • orq.ai vs orq.ai — 例如:两个采用不同模型或指令的orq.ai Agent
  • 外部Agent vs外部Agent — 例如:LangGraph vs CrewAI、Vercel vs OpenAI Agents SDK
  • 多Agent对比 — 在单个实验中对比3个及以上Agent

Constraints

约束条件

  • NEVER create datasets inline in the comparison script — delegate to
    generate-synthetic-dataset
    skill or use
    { dataset_id: "..." }
    (Python) /
    { datasetId: "..." }
    (TypeScript) to load from the platform.
  • NEVER design evaluator prompts from scratch — delegate to
    build-evaluator
    skill.
  • NEVER write expected outputs biased toward one agent's mock/hardcoded data.
  • NEVER compare agents on different models unless isolating the model difference is the explicit goal.
  • ALWAYS ensure test queries are answerable by ALL agents in the experiment.
  • ALWAYS use the same evaluator(s) for all agents to ensure fair scoring.
  • ALWAYS confirm each agent can be invoked independently before running the full experiment.
Why these constraints: Biased datasets produce meaningless rankings. Inline datasets bypass validation. Different models confound framework comparisons. Untested agents waste experiment budget on invocation errors.
  • 绝对不要在对比脚本中直接创建数据集——请委托
    generate-synthetic-dataset
    技能,或使用
    { dataset_id: "..." }
    (Python)/
    { datasetId: "..." }
    (TypeScript)从平台加载数据集。
  • 绝对不要从头设计评估器提示词——请委托
    build-evaluator
    技能。
  • 绝对不要编写偏向某一Agent模拟/硬编码数据的预期输出。
  • 绝对不要在未明确以隔离模型差异为目标的情况下,对比采用不同模型的Agent。
  • 务必确保测试查询可被实验中的所有Agent解答。
  • 务必为所有Agent使用相同的评估器,以确保评分公平。
  • 务必在运行完整实验前,确认每个Agent均可独立调用。
约束原因:有偏差的数据集会产生无意义的排名;内嵌数据集会绕过验证;不同模型会干扰框架对比;未测试的Agent会因调用错误浪费实验预算。

Companion Skills

配套技能

  • generate-synthetic-dataset
    — create the evaluation dataset
  • build-evaluator
    — design the LLM-as-a-judge evaluator
  • run-experiment
    — run orq.ai-native experiments (when no external agents are involved)
  • build-agent
    — create orq.ai agents to include in comparisons
  • analyze-trace-failures
    — diagnose agent failures from trace data
  • generate-synthetic-dataset
    — 创建评估数据集
  • build-evaluator
    — 设计LLM-as-a-judge评估器
  • run-experiment
    — 运行orq.ai原生实验(不涉及外部Agent时使用)
  • build-agent
    — 创建用于对比的orq.ai Agent
  • analyze-trace-failures
    — 根据追踪数据诊断Agent故障

Workflow Checklist

工作流程检查清单

Copy this to track progress:
Agent Comparison Progress:
- [ ] Phase 1: Identify agents, frameworks, and language (Python/TS)
- [ ] Phase 2: Create dataset (→ generate-synthetic-dataset)
- [ ] Phase 3: Create evaluator (→ build-evaluator)
- [ ] Phase 4: Generate comparison script
- [ ] Phase 5: Run and view results in orq.ai
复制此清单跟踪进度:
Agent对比进度:
- [ ] 阶段1:确定Agent、框架和开发语言(Python/TS)
- [ ] 阶段2:创建数据集(→ generate-synthetic-dataset)
- [ ] 阶段3:创建评估器(→ build-evaluator)
- [ ] 阶段4:生成对比脚本
- [ ] 阶段5:运行实验并在orq.ai中查看结果

Done When

完成标准

  • All agents independently invocable and verified before the full experiment
  • Experiment completed and results visible in the orq.ai Experiment UI
  • Scores compared across all agents with the same evaluator(s)
  • Clear winner identified or next steps defined (e.g., deeper investigation with
    analyze-trace-failures
    )
  • 所有Agent在完整实验前已独立调用并验证
  • 实验完成,结果可在orq.ai实验UI中查看
  • 使用相同评估器对比所有Agent的得分
  • 已确定明确的获胜者或定义后续步骤(例如:使用
    analyze-trace-failures
    进行深入调查)

When to use

使用场景

  • User wants to compare agents built with different frameworks
  • User wants to benchmark an orq.ai agent against an external agent
  • User wants to compare 3+ agents in a single experiment
  • User says "compare agents", "benchmark", "test agents side-by-side"
  • 用户需要对比不同框架构建的Agent
  • 用户需要将orq.ai Agent与外部Agent进行基准测试
  • 用户需要在单个实验中对比3个及以上Agent
  • 用户提出“对比Agent”、“基准测试”、“并排测试Agent”等需求

When NOT to use

非使用场景

  • Just need a dataset? →
    generate-synthetic-dataset
  • Just need an evaluator? →
    build-evaluator
  • Comparing orq.ai configurations only (no external agents)? →
    run-experiment
  • Need to identify failure modes first? →
    analyze-trace-failures
  • 仅需要数据集?→
    generate-synthetic-dataset
  • 仅需要评估器?→
    build-evaluator
  • 仅对比orq.ai配置(无外部Agent)?→
    run-experiment
  • 需要先识别故障模式?→
    analyze-trace-failures

Resources

资源

  • Job patterns (all frameworks, Python + TypeScript): See resources/job-patterns.md
  • evaluatorq API reference: See resources/evaluatorq-api.md
  • Known gotchas: See resources/gotchas.md
  • 作业模式(所有框架,Python + TypeScript):查看resources/job-patterns.md
  • evaluatorq API参考:查看resources/evaluatorq-api.md
  • 常见陷阱:查看resources/gotchas.md

orq.ai Documentation

orq.ai 文档

Key Concepts

核心概念

  • evaluatorq is the evaluation runner from orqkit — available as
    evaluatorq
    (Python) and
    @orq-ai/evaluatorq
    (TypeScript)
  • Jobs wrap agent invocations so evaluatorq can run them against a dataset
  • Evaluators score each job's output — use orq.ai LLM-as-a-judge evaluators invoked by ID
  • Results are automatically reported to the orq.ai Experiment UI when
    ORQ_API_KEY
    is set
  • evaluatorqorqkit提供的评估运行器——有
    evaluatorq
    (Python版)和
    @orq-ai/evaluatorq
    (TypeScript版)两种形式
  • **作业(Jobs)**封装Agent调用,使evaluatorq可针对数据集运行
  • **评估器(Evaluators)**为每个作业的输出评分——使用通过ID调用的orq.ai LLM-as-a-judge评估器
  • 设置
    ORQ_API_KEY
    后,结果会自动上报至orq.ai实验UI

orq MCP Tools

orq MCP工具

ToolPurpose
search_entities
Find orq.ai agent keys (use
type: "agent"
)
create_dataset
Create a dataset
create_datapoints
Populate dataset with test cases
create_llm_eval
Create an LLM-as-a-judge evaluator
工具用途
search_entities
查找orq.ai Agent密钥(使用
type: "agent"
create_dataset
创建数据集
create_datapoints
为数据集填充测试用例
create_llm_eval
创建LLM-as-a-judge评估器

Prerequisites

前置条件

  • The orq.ai MCP server is connected
  • An
    ORQ_API_KEY
    environment variable is set
  • Python:
    pip install evaluatorq orq-ai-sdk
  • TypeScript:
    npm install @orq-ai/evaluatorq
  • The agents to compare exist and are invocable (locally or via API)

  • 已连接orq.ai MCP服务器
  • 已设置
    ORQ_API_KEY
    环境变量
  • Python环境
    pip install evaluatorq orq-ai-sdk
  • TypeScript环境
    npm install @orq-ai/evaluatorq
  • 待对比的Agent已存在且可调用(本地或通过API)

Steps

步骤

Phase 1: Identify Agents

阶段1:确定Agent

  1. Ask the user which agents to compare. For each agent, determine:
    • Framework (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK, or generic)
    • How to invoke it (agent key, import path, HTTP endpoint)
  2. For orq.ai agents, get the agent key:
    • Use
      search_entities
      MCP tool with
      type: "agent"
      to find available agents
  3. For external agents, confirm they can be called from Python/TypeScript:
    • Verify import paths, API endpoints, or local availability
    • Test each agent independently before proceeding
  4. Ask the user's language preference: Python or TypeScript. Default to Python if no preference.
  1. 询问用户需要对比哪些Agent。针对每个Agent,确定:
    • 框架(orq.ai、LangGraph、CrewAI、OpenAI Agents SDK、Vercel AI SDK或通用框架)
    • 调用方式(Agent密钥、导入路径、HTTP端点)
  2. 对于orq.ai Agent,获取Agent密钥:
    • 使用
      search_entities
      MCP工具并设置
      type: "agent"
      查找可用Agent
  3. 对于外部Agent,确认可通过Python/TypeScript调用:
    • 验证导入路径、API端点或本地可用性
    • 在继续之前独立测试每个Agent
  4. 询问用户的语言偏好:Python或TypeScript。若无偏好则默认使用Python。

Phase 2: Create Dataset

阶段2:创建数据集

  1. Delegate to
    generate-synthetic-dataset
    to create a dataset with 5-10 datapoints.
    Critical reminders for cross-framework comparison datasets:
    • Queries must be answerable by ALL agents in the experiment
    • Expected outputs must NOT be biased toward any agent's mock/hardcoded data
    • For dynamic answers, write expected outputs as correctness criteria, not specific values
    • Mix question types: computation, tool-dependent, multi-step
  1. 委托
    generate-synthetic-dataset
    技能
    创建包含5-10个数据点的数据集。
    跨框架对比数据集的关键注意事项:
    • 查询必须可被实验中的所有Agent解答
    • 预期输出不得偏向任何Agent的模拟/硬编码数据
    • 对于动态答案,将预期输出编写为正确性标准,而非具体值
    • 混合问题类型:计算类、依赖工具类、多步骤类

Phase 3: Create Evaluator

阶段3:创建评估器

  1. Delegate to
    build-evaluator
    to create an LLM-as-a-judge evaluator. Save the returned evaluator ID.
    For quick experiments, use the
    create_llm_eval
    MCP tool directly with a response-quality prompt. Ensure the prompt uses "factual correctness" language, not "compared to the reference" (see gotchas).
  1. 委托
    build-evaluator
    技能
    创建LLM-as-a-judge评估器。保存返回的评估器ID。
    若需快速实验,可直接使用
    create_llm_eval
    MCP工具并传入响应质量提示词。确保提示词使用“事实正确性”相关表述,而非“与参考结果对比”(查看常见陷阱)。

Phase 4: Generate Comparison Script

阶段4:生成对比脚本

  1. Select job patterns from resources/job-patterns.md for each agent's framework.
  2. Assemble the script using the evaluatorq API from resources/evaluatorq-api.md:
    • Import evaluatorq, job, DataPoint, EvaluationResult
    • Define one job per agent
    • Define an evaluator scorer that invokes the orq.ai LLM-as-a-judge by ID
    • Wire jobs + data + evaluators into the
      evaluatorq()
      call
  3. Common configurations:
    Experiment TypeJobs to Include
    External vs orq.aiOne external job + one orq.ai job
    orq.ai vs orq.aiTwo orq.ai jobs with different
    agent_key
    values
    External vs externalTwo external jobs (e.g., LangGraph + CrewAI)
    Multi-agentThree or more jobs of any type
  4. Replace all placeholders in the generated script:
    • <EVALUATOR_ID>
      — evaluator ID from Phase 3
    • <AGENT_KEY>
      — orq.ai agent key(s) from Phase 1
    • <experiment-name>
      — descriptive experiment name
    • Framework-specific placeholders (import paths, endpoints)
  1. resources/job-patterns.md中选择对应每个Agent框架的作业模式。
  2. 使用resources/evaluatorq-api.md中的evaluatorq API组装脚本
    • 导入evaluatorq、job、DataPoint、EvaluationResult
    • 为每个Agent定义一个作业
    • 定义通过ID调用orq.ai LLM-as-a-judge的评估器评分器
    • 将作业、数据、评估器接入
      evaluatorq()
      调用
  3. 常见配置
    实验类型需包含的作业
    外部Agent vs orq.ai一个外部作业 + 一个orq.ai作业
    orq.ai vs orq.ai两个使用不同
    agent_key
    值的orq.ai作业
    外部Agent vs外部Agent两个外部作业(例如:LangGraph + CrewAI)
    多Agent对比三个及以上任意类型的作业
  4. 替换生成脚本中的所有占位符
    • <EVALUATOR_ID>
      — 阶段3获取的评估器ID
    • <AGENT_KEY>
      — 阶段1获取的orq.ai Agent密钥
    • <experiment-name>
      — 描述性实验名称
    • 框架特定占位符(导入路径、端点)

Phase 5: Run and View Results

阶段5:运行并查看结果

  1. Run the script:
    bash
    # Python
    export ORQ_API_KEY="your-key"
    python evaluate.py
    
    # TypeScript
    export ORQ_API_KEY="your-key"
    npx tsx evaluate.ts
  2. View results in orq.ai:
    • Open my.orq.ai → navigate to your project → Experiments
    • Compare scores across all agents — response quality, latency, and cost
  3. If issues arise, check resources/gotchas.md for common pitfalls.
  4. Iterate: If one agent consistently underperforms, investigate with
    analyze-trace-failures
    , improve with
    optimize-prompt
    , then re-run the comparison.

  1. 运行脚本
    bash
    # Python
    export ORQ_API_KEY="your-key"
    python evaluate.py
    
    # TypeScript
    export ORQ_API_KEY="your-key"
    npx tsx evaluate.ts
  2. 在orq.ai中查看结果
    • 打开my.orq.ai → 导航至您的项目 → 实验
    • 对比所有Agent的得分:响应质量、延迟和成本
  3. 若出现问题,查看resources/gotchas.md中的常见陷阱。
  4. 迭代优化:若某一Agent持续表现不佳,使用
    analyze-trace-failures
    进行调查,通过
    optimize-prompt
    优化后重新运行对比。

Open in orq.ai

在orq.ai中打开

After running the comparison:
When this skill conflicts with live API responses or docs.orq.ai, trust the API.
运行对比后:
当本技能与实时API响应或docs.orq.ai文档冲突时,以API为准。