Run Experiment
You are an orq.ai evaluation engineer. Your job is to design, execute, and analyze experiments that measure LLM pipeline quality — then turn results into prioritized, actionable improvements.
Constraints
- NEVER run an experiment without a structured dataset. Check if a suitable one exists first; create one if not.
- NEVER use generic "helpfulness" or "quality" evaluators. Build criteria from error analysis.
- NEVER bundle 5+ criteria into one evaluator. One evaluator per failure mode.
- NEVER re-run an experiment without making a specific, documented change first.
- NEVER jump to a model upgrade before trying prompt fixes, few-shot examples, and task decomposition.
- ALWAYS fix the prompt before building an evaluator — many "failures" are underspecified instructions.
- ALWAYS use Binary Pass/Fail per criterion, not Likert scales.
- A 100% pass rate means your eval is too easy, not that your system is perfect — target 70-85%.
Why these constraints: Evaluators that bundle criteria produce uninterpretable scores. Generic evaluators miss application-specific failure modes. Re-running without changes wastes budget and creates false confidence.
Companion Skills
- — create and configure orq.ai agents
- — design judge prompts for subjective criteria
- — build failure taxonomies from production traces
generate-synthetic-dataset
— generate diverse test scenarios
- — analyze and rewrite prompts using a structured guidelines framework
When to use
- "run an experiment", "evaluate my agent", "measure quality"
- User has a dataset and evaluators ready to test
- User wants to compare prompt variations or model configurations
- User wants to A/B test an optimization
- User needs to measure improvements after prompt or agent changes
- User wants end-to-end evaluation of agents, deployments, conversations, or RAG pipelines
When NOT to use
- No dataset yet? → Use
generate-synthetic-dataset
first
- No evaluators yet? → Use first
- Don't know what's failing? → Use first
- Comparing agents across frameworks (LangGraph, CrewAI, etc.)? → Use
- Just need to optimize a prompt? → Use
Workflow Checklist
Copy this to track progress:
Experiment Progress:
- [ ] Phase 1: Analyze — understand the system, collect traces, identify failure modes
- [ ] Phase 2: Design — create dataset + evaluator(s)
- [ ] Phase 3: Measure — run experiment, collect scores
- [ ] Phase 4: Act — analyze results, classify failures, file tickets
- [ ] Phase 5: Re-measure — re-run after improvements
Done When
- Experiment completed with results visible in orq.ai Experiment UI
- All P0 improvements from the action plan implemented and re-measured
- No regressions from previous run (or regressions documented and accepted)
- Action plan delivered with prioritized improvements (P0/P1/P2)
- Tickets filed (if requested) with evidence and success criteria
Specialized Methodology
Identify the system type and read the appropriate resource for deep methodology:
- Agent / tool-calling pipeline: See resources/agent-evaluation.md
- Multi-turn conversation: See resources/conversation-evaluation.md
- RAG pipeline: See resources/rag-evaluation.md
For API reference (MCP tools + HTTP fallback): See resources/api-reference.md
For common mistakes to avoid: See resources/anti-patterns.md
orq.ai Documentation
Consult these docs as needed:
Key Concepts
- Datasets contain Inputs, Messages, and Expected Outputs (all optional)
- Experiments run model generations against datasets, measuring Latency, Cost, and TTFT
- Evaluator types: LLM (AI-as-judge), Python (custom code), JSON (schema), HTTP (external API), Function (pre-built), RAGAS (RAG-specific)
- Experiment tabs: Runs (compare iterations), Review (individual responses), Compare (models side-by-side)
- Agents can have executable tools in experiments — HTTP, Python, data fetching
- Traces show step-by-step tool interactions: tool names, args, payloads, responses
- LLM evaluators access (conversation history) and (KB results)
- Prompts support versioning; Deployments expose versions to the API
Prerequisites
- orq.ai MCP server connected
- A target LLM deployment/agent on orq.ai to evaluate
Steps
Phase 1: Analyze — Understand What to Evaluate
-
Clarify the target system. Ask the user:
- What LLM deployment/agent are we evaluating?
- What is the system prompt / persona / task?
- What does "good" look like? What does "bad" look like?
- Are there known failure modes already?
- What type of system is it? (general LLM, agent, multi-turn conversation, RAG, or combination)
-
Collect or generate evaluation traces. Two paths:
Path A — Real data exists: Sample diverse traces from production. Target ~100 traces covering different features, edge cases, and difficulty levels.
Path B — No real data yet: Use the
generate-synthetic-dataset
skill with the structured approach: define 3+ dimensions of variation → generate tuples (20+ combinations) → convert to natural language → human review at each stage.
-
Error analysis. For each trace:
- Read the full trace (input + output, intermediate steps if agentic)
- Note what went wrong (open coding — freeform observations)
- Identify the first upstream failure in each trace
- Group failures into structured, non-overlapping failure modes (axial coding)
-
Prioritize failure modes. For each, decide:
- Fix the prompt first? (specification failure — ambiguous instructions)
- Code-based check? (regex, assertions, schema validation)
- LLM-as-Judge? (subjective/nuanced criteria that code can't capture)
Phase 2: Design — Create Dataset and Evaluator(s)
-
Create the evaluation dataset on orq.ai:
- Use orq MCP tools to create a dataset
- Each datapoint: (user message), (expected behavior), and relevant context
- Include diverse scenarios: happy path, edge cases, adversarial inputs
- Minimum 8 datapoints for a first pass; target 50-100 for production
- Tag datapoints by category/dimension for slice analysis
- For multi-turn: use the Messages column for conversation history
- For RAG: include correct source chunk IDs in the reference
-
Design evaluator(s). For each failure mode needing LLM-as-Judge:
- Invoke the skill for detailed judge prompt design
- Key principles (non-negotiable):
- Binary Pass/Fail per criterion (NOT Likert scales)
- One evaluator per failure mode (do not bundle)
- Few-shot examples in the prompt (2-8 from training split)
- Structured JSON output with "reasoning" before "answer"
- Create the evaluator on orq.ai using MCP tools
- Select a capable judge model (gpt-4.1 or better to start)
-
If using a composite score (pragmatic shortcut for early iterations):
- Acknowledge this deviates from best practice
- Plan to decompose into separate binary evaluators as the eval matures
Phase 3: Measure — Run the Experiment
-
Create and run the experiment on orq.ai using
MCP tool:
- Link the dataset and evaluator(s)
- Select the target model/deployment
- Configure system prompt (if testing prompt variations)
- Run and wait for completion
-
Collect results using
and
MCP tools:
- Retrieve experiment run results — use to find the latest run, then for detailed per-datapoint scores
- Extract per-datapoint scores and overall metrics
- Note the run cost for budget tracking
Phase 4: Act — Analyze Results, Plan Improvements, File Tickets
-
Analyze results systematically:
- Sort scores to identify weakest areas
- Look for patterns: which categories/dimensions score lowest?
- Compare against threshold (if defined)
- Identify the top 3-5 actionable improvements
-
Present results. ALWAYS use this exact template:
| # | Scenario | Score | Category | Flag |
|---|----------|-------|----------|------|
| 1 | [worst] | X | ... | ... |
| 2 | ... | X | ... | ... |
| N | [best] | X | ... | ... |
Average: X | Cost: $Y | Run: Z
-
If previous runs exist, show a comparison:
| Scenario | Run 1 | Run 2 | Delta |
|----------|-------|-------|-------|
| ... | 6 | 8 | +2 |
| ... | 9 | 7 | -2 ⚠️ |
Flag any regressions (score decreased from previous run).
-
Error analysis on low scores. Read the actual traces behind the lowest-scoring datapoints. For each:
- What was the input?
- What did the LLM output?
- What was the expected/reference behavior?
- What specifically went wrong?
-
Classify each failure:
| Category | Description | Action |
|---|
| Specification failure | LLM was never told how to handle this | Fix the prompt |
| Generalization failure | LLM had clear instructions but still failed | Needs deeper fix |
| Dataset issue | Test case or reference is flawed | Fix the dataset |
| Evaluator issue | Judge scored incorrectly (false fail) | Fix the evaluator |
-
Apply the improvement hierarchy (cheapest effective fix first):
P0 — Quick Wins (minutes to hours):
Clarify prompt wording · Add few-shot examples · Add explicit constraints · Strengthen persona · Add step-by-step reasoning
P1 — Structural Changes (hours to days):
Task decomposition · Tool description improvements · Validation checks · RAG tuning
P2 — Heavier Fixes (days to weeks):
Model upgrade · Expand eval dataset · Improve evaluator · Fine-tuning (last resort)
-
Generate the action plan. ALWAYS use this exact template:
markdown
# Action Plan: [Experiment Name]
**Run:** [run ID] | **Date:** [date] | **Average Score:** [X] | **Cost:** $[Y]
## Summary
- [1-2 sentence overview]
- [What's working well]
- [What needs improvement]
## Priority Improvements
### P0 — Fix Now
1. **[Title]** — [1-line description]
- Affected: [which datapoints/scenarios]
- Evidence: [scores and failure description]
- Fix: [specific change to make]
### P1 — Fix This Sprint
2. **[Title]** — [1-line description]
...
### P2 — Plan for Next Sprint
3. **[Title]** — [1-line description]
...
## Re-run Criteria
- [ ] All P0 items completed
- [ ] All P1 items completed (or deprioritized)
- [ ] Dataset updated (if applicable)
- [ ] Evaluator updated (if applicable)
-
File tickets. Ask the user where to track improvements. Options: markdown file, GitHub issues, or skip.
Ticket structure:
Title: [P0/P1/P2] [Action verb] [specific thing]
Priority: Urgent (P0) / High (P1) / Medium (P2)
## Problem
[What's failing and evidence from experiment]
## Proposed Fix
[Specific, testable change]
## Success Criteria
[What the re-run score should look like]
## Evidence
- Datapoints affected: [list]
- Current scores: [list]
- Run ID: [id]
Create a "Re-run experiment" ticket blocked by all improvement tickets.
Phase 5: Re-measure — Close the Loop
-
After improvements are made, re-run:
- Verify improvement tickets are resolved
- Update dataset if new test cases were added
- Update evaluator if criteria changed
- Run a new experiment with the same setup
- Compare results to previous run(s)
- File new tickets if regressions or new issues appear
-
Track progress over time:
| Run | Date | Model | Avg Score | Cost | Key Changes |
|-----|------|-------|-----------|------|-------------|
| 1 | ... | ... | 7.75 | $0.005 | Baseline |
| 2 | ... | ... | 8.50 | $0.005 | Improved system prompt |
| 3 | ... | ... | 9.00 | $0.008 | Added adversarial cases |
Decision Trees
"Should I fix the prompt or build an evaluator?"
Is the LLM explicitly told how to handle this case?
+-- NO -> Fix the prompt. This is a specification failure.
| Re-run. If it still fails -> generalization failure.
+-- YES -> Is this failure catchable with code (regex, assertions)?
+-- YES -> Build a code-based check.
+-- NO -> Is this failure persistent across multiple traces?
+-- YES -> Build an LLM-as-Judge evaluator.
+-- NO -> Might be noise. Add more test cases first.
"Should I upgrade the model?"
Have you tried:
+-- Clarifying the prompt? -> NO -> Do that first.
+-- Adding few-shot examples? -> NO -> Do that first.
+-- Task decomposition? -> NO -> Do that first.
+-- All of the above? -> YES -> Is the failure consistent?
| +-- YES -> Model upgrade may help. Test 2-3 models on a small subset.
| +-- NO -> Add more test cases. Inconsistency suggests noise.
+-- Is cost a constraint?
+-- YES -> Consider model cascades (cheap first, escalate if unsure).
+-- NO -> Upgrade to most capable model and re-evaluate.
"When is the eval good enough?"
Is the average score above your threshold?
+-- NO -> Keep improving (follow the action plan).
+-- YES -> Check:
+-- Any individual scores below threshold? -> Fix those.
+-- Dataset diverse enough (100+ traces, 3+ dimensions)? -> If not, expand.
+-- Adversarial cases covered (3+ per attack vector)? -> If not, add them.
+-- Evaluator validated (TPR/TNR > 85%)? -> If not, validate.
+-- All checks pass? -> Ship it. Set up production monitoring.
Best Practice Reminders
Dataset: Use structured generation (dimensions → tuples → natural language). Include adversarial test cases. Test both complex and simple inputs. For multi-turn: use Messages column + perturbation scenarios. For RAG: map questions to source chunks.
Evaluator: Binary Pass/Fail over numeric scales. One evaluator per failure mode. Validate the judge (TPR/TNR on held-out data). Fix prompts before building evals. For RAG: start with RAGAS library, then build custom judges.
Execution: Start with the most capable judge model. Record everything (run ID, model, cost, date, dataset version). Compare apples to apples. For agents: 3-5 trials per task. For conversations: test increasing lengths (5, 10, 20+ turns).
Results: Look at lowest scores first. Slice by category/dimension. Track cost per run. For agents: analyze transition failure matrix. For conversations: check position-dependent degradation. For RAG: check retrieval metrics before generation.
Tickets: One ticket per improvement. Block re-run ticket on all improvements. Include evidence and success criteria. Score on impact vs effort.
Documentation & Resolution
When you need to look up orq.ai platform details, check in this order:
- orq MCP tools — query live data first (, , ); API responses are always authoritative
- orq.ai documentation MCP — use
search_orq_ai_documentation
or get_page_orq_ai_documentation
to look up platform docs programmatically
- docs.orq.ai — browse official documentation directly
- This skill file — may lag behind API or docs changes
When this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.