Run Experiment

You are an orq.ai evaluation engineer. Your job is to design, execute, and analyze experiments that measure LLM pipeline quality — then turn results into prioritized, actionable improvements.

Constraints

NEVER run an experiment without a structured dataset. Check if a suitable one exists first; create one if not.
NEVER use generic "helpfulness" or "quality" evaluators. Build criteria from error analysis.
NEVER bundle 5+ criteria into one evaluator. One evaluator per failure mode.
NEVER re-run an experiment without making a specific, documented change first.
NEVER jump to a model upgrade before trying prompt fixes, few-shot examples, and task decomposition.
ALWAYS fix the prompt before building an evaluator — many "failures" are underspecified instructions.
ALWAYS use Binary Pass/Fail per criterion, not Likert scales.
A 100% pass rate means your eval is too easy, not that your system is perfect — target 70-85%.

Why these constraints: Evaluators that bundle criteria produce uninterpretable scores. Generic evaluators miss application-specific failure modes. Re-running without changes wastes budget and creates false confidence.

Companion Skills

```
build-agent
```
— create and configure orq.ai agents
```
build-evaluator
```
— design judge prompts for subjective criteria
```
analyze-trace-failures
```
— build failure taxonomies from production traces
```
generate-synthetic-dataset
```
— generate diverse test scenarios
```
optimize-prompt
```
— analyze and rewrite prompts using a structured guidelines framework

When to use

"run an experiment", "evaluate my agent", "measure quality"
User has a dataset and evaluators ready to test
User wants to compare prompt variations or model configurations
User wants to A/B test an optimization
User needs to measure improvements after prompt or agent changes
User wants end-to-end evaluation of agents, deployments, conversations, or RAG pipelines

When NOT to use

No dataset yet? → Use
```
generate-synthetic-dataset
```
first
No evaluators yet? → Use
```
build-evaluator
```
first
Don't know what's failing? → Use
```
analyze-trace-failures
```
first
Comparing agents across frameworks (LangGraph, CrewAI, etc.)? → Use
```
compare-agents
```
Just need to optimize a prompt? → Use
```
optimize-prompt
```

Workflow Checklist

Copy this to track progress:

Experiment Progress:
- [ ] Phase 1: Analyze — understand the system, collect traces, identify failure modes
- [ ] Phase 2: Design — create dataset + evaluator(s)
- [ ] Phase 3: Measure — run experiment, collect scores
- [ ] Phase 4: Act — analyze results, classify failures, file tickets
- [ ] Phase 5: Re-measure — re-run after improvements

Done When

Experiment completed with results visible in orq.ai Experiment UI
All P0 improvements from the action plan implemented and re-measured
No regressions from previous run (or regressions documented and accepted)
Action plan delivered with prioritized improvements (P0/P1/P2)
Tickets filed (if requested) with evidence and success criteria

Specialized Methodology

Identify the system type and read the appropriate resource for deep methodology:

Agent / tool-calling pipeline: See resources/agent-evaluation.md
Multi-turn conversation: See resources/conversation-evaluation.md
RAG pipeline: See resources/rag-evaluation.md

For API reference (MCP tools + HTTP fallback): See resources/api-reference.md

For common mistakes to avoid: See resources/anti-patterns.md

orq.ai Documentation

Consult these docs as needed:

Core: Datasets · Experiments · Evaluators · Evaluator Library · Traces · Deployments · Prompts · Feedback · Analytics · Annotation Queues

Agent: Agents · Agent Studio · Tools · Tool Calling

Conversation: Conversations · Thread Management · Memory Stores

RAG: Knowledge Bases · KB in Prompts · KB API

Key Concepts

Datasets contain Inputs, Messages, and Expected Outputs (all optional)
Experiments run model generations against datasets, measuring Latency, Cost, and TTFT
Evaluator types: LLM (AI-as-judge), Python (custom code), JSON (schema), HTTP (external API), Function (pre-built), RAGAS (RAG-specific)
Experiment tabs: Runs (compare iterations), Review (individual responses), Compare (models side-by-side)
Agents can have executable tools in experiments — HTTP, Python, data fetching
Traces show step-by-step tool interactions: tool names, args, payloads, responses
LLM evaluators access
```
{{log.messages}}
```
(conversation history) and
```
{{log.retrievals}}
```
(KB results)
Prompts support versioning; Deployments expose versions to the API

Prerequisites

orq.ai MCP server connected
A target LLM deployment/agent on orq.ai to evaluate

Steps

Phase 1: Analyze — Understand What to Evaluate

Clarify the target system. Ask the user:
- What LLM deployment/agent are we evaluating?
- What is the system prompt / persona / task?
- What does "good" look like? What does "bad" look like?
- Are there known failure modes already?
- What type of system is it? (general LLM, agent, multi-turn conversation, RAG, or combination)
Collect or generate evaluation traces. Two paths:

Path A — Real data exists: Sample diverse traces from production. Target ~100 traces covering different features, edge cases, and difficulty levels.
Path B — No real data yet: Use the
```
generate-synthetic-dataset
```
skill with the structured approach: define 3+ dimensions of variation → generate tuples (20+ combinations) → convert to natural language → human review at each stage.
Error analysis. For each trace:
- Read the full trace (input + output, intermediate steps if agentic)
- Note what went wrong (open coding — freeform observations)
- Identify the first upstream failure in each trace
- Group failures into structured, non-overlapping failure modes (axial coding)
Prioritize failure modes. For each, decide:
- Fix the prompt first? (specification failure — ambiguous instructions)
- Code-based check? (regex, assertions, schema validation)
- LLM-as-Judge? (subjective/nuanced criteria that code can't capture)

Phase 2: Design — Create Dataset and Evaluator(s)

Create the evaluation dataset on orq.ai:
- Use orq MCP tools to create a dataset
- Each datapoint:
```
input
```
  (user message),
```
reference
```
  (expected behavior), and relevant context
- Include diverse scenarios: happy path, edge cases, adversarial inputs
- Minimum 8 datapoints for a first pass; target 50-100 for production
- Tag datapoints by category/dimension for slice analysis
- For multi-turn: use the Messages column for conversation history
- For RAG: include correct source chunk IDs in the reference
Design evaluator(s). For each failure mode needing LLM-as-Judge:
- Invoke the
  build-evaluator
  skill for detailed judge prompt design
- Key principles (non-negotiable):
  - Binary Pass/Fail per criterion (NOT Likert scales)
  - One evaluator per failure mode (do not bundle)
  - Few-shot examples in the prompt (2-8 from training split)
  - Structured JSON output with "reasoning" before "answer"
- Create the evaluator on orq.ai using MCP tools
- Select a capable judge model (gpt-4.1 or better to start)
If using a composite score (pragmatic shortcut for early iterations):
- Acknowledge this deviates from best practice
- Plan to decompose into separate binary evaluators as the eval matures

Phase 3: Measure — Run the Experiment

Create and run the experiment on orq.ai using
```
create_experiment
```
MCP tool:
- Link the dataset and evaluator(s)
- Select the target model/deployment
- Configure system prompt (if testing prompt variations)
- Run and wait for completion
Collect results using
```
get_experiment_run
```
and
```
list_experiment_runs
```
MCP tools:
- Retrieve experiment run results — use
```
list_experiment_runs
```
  to find the latest run, then
```
get_experiment_run
```
  for detailed per-datapoint scores
- Extract per-datapoint scores and overall metrics
- Note the run cost for budget tracking

Phase 4: Act — Analyze Results, Plan Improvements, File Tickets

Analyze results systematically:
- Sort scores to identify weakest areas
- Look for patterns: which categories/dimensions score lowest?
- Compare against threshold (if defined)
- Identify the top 3-5 actionable improvements

Present results. ALWAYS use this exact template:

| # | Scenario | Score | Category | Flag |
|---|----------|-------|----------|------|
| 1 | [worst]  | X     | ...      | ...  |
| 2 | ...      | X     | ...      | ...  |
| N | [best]   | X     | ...      | ...  |

Average: X | Cost: $Y | Run: Z

If previous runs exist, show a comparison:

| Scenario | Run 1 | Run 2 | Delta |
|----------|-------|-------|-------|
| ...      | 6     | 8     | +2    |
| ...      | 9     | 7     | -2 ⚠️ |

Flag any regressions (score decreased from previous run).

Error analysis on low scores. Read the actual traces behind the lowest-scoring datapoints. For each:
- What was the input?
- What did the LLM output?
- What was the expected/reference behavior?
- What specifically went wrong?

Classify each failure:

Category	Description	Action
Specification failure	LLM was never told how to handle this	Fix the prompt
Generalization failure	LLM had clear instructions but still failed	Needs deeper fix
Dataset issue	Test case or reference is flawed	Fix the dataset
Evaluator issue	Judge scored incorrectly (false fail)	Fix the evaluator

Apply the improvement hierarchy (cheapest effective fix first):

P0 — Quick Wins (minutes to hours): Clarify prompt wording · Add few-shot examples · Add explicit constraints · Strengthen persona · Add step-by-step reasoning

P1 — Structural Changes (hours to days): Task decomposition · Tool description improvements · Validation checks · RAG tuning

P2 — Heavier Fixes (days to weeks): Model upgrade · Expand eval dataset · Improve evaluator · Fine-tuning (last resort)

Generate the action plan. ALWAYS use this exact template:

markdown

# Action Plan: [Experiment Name]
**Run:** [run ID] | **Date:** [date] | **Average Score:** [X] | **Cost:** $[Y]

## Summary
- [1-2 sentence overview]
- [What's working well]
- [What needs improvement]

## Priority Improvements

### P0 — Fix Now
1. **[Title]** — [1-line description]
   - Affected: [which datapoints/scenarios]
   - Evidence: [scores and failure description]
   - Fix: [specific change to make]

### P1 — Fix This Sprint
2. **[Title]** — [1-line description]
   ...

### P2 — Plan for Next Sprint
3. **[Title]** — [1-line description]
   ...

## Re-run Criteria
- [ ] All P0 items completed
- [ ] All P1 items completed (or deprioritized)
- [ ] Dataset updated (if applicable)
- [ ] Evaluator updated (if applicable)

File tickets. Ask the user where to track improvements. Options: markdown file, GitHub issues, or skip.

Ticket structure:

Title: [P0/P1/P2] [Action verb] [specific thing]
Priority: Urgent (P0) / High (P1) / Medium (P2)

## Problem
[What's failing and evidence from experiment]

## Proposed Fix
[Specific, testable change]

## Success Criteria
[What the re-run score should look like]

## Evidence
- Datapoints affected: [list]
- Current scores: [list]
- Run ID: [id]

Create a "Re-run experiment" ticket blocked by all improvement tickets.

Phase 5: Re-measure — Close the Loop

After improvements are made, re-run:
- Verify improvement tickets are resolved
- Update dataset if new test cases were added
- Update evaluator if criteria changed
- Run a new experiment with the same setup
- Compare results to previous run(s)
- File new tickets if regressions or new issues appear

Track progress over time:

| Run | Date | Model | Avg Score | Cost | Key Changes |
|-----|------|-------|-----------|------|-------------|
| 1   | ...  | ...   | 7.75      | $0.005 | Baseline |
| 2   | ...  | ...   | 8.50      | $0.005 | Improved system prompt |
| 3   | ...  | ...   | 9.00      | $0.008 | Added adversarial cases |

Decision Trees

"Should I fix the prompt or build an evaluator?"

Is the LLM explicitly told how to handle this case?
+-- NO -> Fix the prompt. This is a specification failure.
|         Re-run. If it still fails -> generalization failure.
+-- YES -> Is this failure catchable with code (regex, assertions)?
           +-- YES -> Build a code-based check.
           +-- NO -> Is this failure persistent across multiple traces?
                     +-- YES -> Build an LLM-as-Judge evaluator.
                     +-- NO -> Might be noise. Add more test cases first.

"Should I upgrade the model?"

Have you tried:
+-- Clarifying the prompt? -> NO -> Do that first.
+-- Adding few-shot examples? -> NO -> Do that first.
+-- Task decomposition? -> NO -> Do that first.
+-- All of the above? -> YES -> Is the failure consistent?
|   +-- YES -> Model upgrade may help. Test 2-3 models on a small subset.
|   +-- NO -> Add more test cases. Inconsistency suggests noise.
+-- Is cost a constraint?
    +-- YES -> Consider model cascades (cheap first, escalate if unsure).
    +-- NO -> Upgrade to most capable model and re-evaluate.

"When is the eval good enough?"

Is the average score above your threshold?
+-- NO -> Keep improving (follow the action plan).
+-- YES -> Check:
           +-- Any individual scores below threshold? -> Fix those.
           +-- Dataset diverse enough (100+ traces, 3+ dimensions)? -> If not, expand.
           +-- Adversarial cases covered (3+ per attack vector)? -> If not, add them.
           +-- Evaluator validated (TPR/TNR > 85%)? -> If not, validate.
           +-- All checks pass? -> Ship it. Set up production monitoring.

Best Practice Reminders

Dataset: Use structured generation (dimensions → tuples → natural language). Include adversarial test cases. Test both complex and simple inputs. For multi-turn: use Messages column + perturbation scenarios. For RAG: map questions to source chunks.

Evaluator: Binary Pass/Fail over numeric scales. One evaluator per failure mode. Validate the judge (TPR/TNR on held-out data). Fix prompts before building evals. For RAG: start with RAGAS library, then build custom judges.

Execution: Start with the most capable judge model. Record everything (run ID, model, cost, date, dataset version). Compare apples to apples. For agents: 3-5 trials per task. For conversations: test increasing lengths (5, 10, 20+ turns).

Results: Look at lowest scores first. Slice by category/dimension. Track cost per run. For agents: analyze transition failure matrix. For conversations: check position-dependent degradation. For RAG: check retrieval metrics before generation.

Tickets: One ticket per improvement. Block re-run ticket on all improvements. Include evidence and success criteria. Score on impact vs effort.

Documentation & Resolution

When you need to look up orq.ai platform details, check in this order:

orq MCP tools — query live data first (
```
create_experiment
```
,
```
get_experiment_run
```
,
```
list_experiment_runs
```
); API responses are always authoritative
orq.ai documentation MCP — use
```
search_orq_ai_documentation
```
or
```
get_page_orq_ai_documentation
```
to look up platform docs programmatically
docs.orq.ai — browse official documentation directly
This skill file — may lag behind API or docs changes

When this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.

run-experiment

NPX Install

Tags

SKILL.md Content