Cekura Metric Design
Purpose
Guide the creation of effective Cekura metrics that accurately evaluate AI voice agent call quality. Metrics measure call quality after the fact by evaluating transcripts against defined criteria. Each metric targets a specific workflow or KPI that needs tracking per call.
Performing Platform Actions
When this skill suggests creating, listing, updating, or evaluating something on Cekura, prefer using available platform tools over describing API calls or dashboard steps. In Claude Code with the Cekura plugin installed, these tools are auto-configured and handle authentication, parameter validation, and error handling for you. Fall back to direct API endpoints or dashboard guidance only when no tools are available in the current session.
Core Terminology
- Main agent: The client's AI voice agent being tested
- Testing agent: Cekura's simulated caller that exercises the main agent
- Metric: A post-call evaluation that scores a transcript
- Evaluator/Scenario: A test case that simulates a caller (separate concept — see cekura-eval-design skill)
The Metric Creation Workflow
Follow this workflow every time. Skipping steps (especially step 2) leads to metrics that miss edge cases.
- Gather context — Understand the client's use case, what they care about, get sample conversation IDs with expected outcomes
- Fetch real transcripts — Pull 3-5 actual records from the Cekura API. Study: what roles appear (, , , ), what timestamps are available, how tool calls are structured, what the conversation flow looks like in practice. Metrics written without reading real data will miss edge cases.
- Identify the signal — What specific thing in the transcript indicates pass vs fail? A tool call, a timestamp gap, a phrase, a behavioral pattern?
- Write the prompt — Use proven structures (see below), grounded in what the real transcripts look like
- Deploy and test — Create the metric via API, run on sample conversations, compare to expected outcomes
- Iterate — Adjust the prompt based on results, re-run, repeat until the metric matches expectations on all samples. Plan for at least one iteration — the first run reveals measurement issues.
Metric Types
llm_judge (preferred default)
An LLM evaluates the prompt in the
field against the call transcript.
Prefer llm_judge over custom_code. Custom_code seems appealing for "objective" checks (timestamps, tool call presence) but is brittle in practice. Voice AI transcripts have messy timing — agents transfer mid-tool-chain, background tasks complete after speech resumes, timestamps overlap. An LLM reading the transcript handles these nuances naturally. Express measurements in natural language, not code.
Critical: The evaluation prompt goes in the
field, NOT the
field.
custom_code (Python on Lambda)
Python code in
field. Has access to
dict,
, and upstream metric results. Reserve for cases that genuinely need programmatic logic.
Use only when:
- Gating on upstream metric results (
data.get("Exact Metric Name")
)
- Section extraction from agent description (Pythonic pattern)
- Multiple LLM calls with different prompts based on conditions
- N/A short-circuiting before calling the LLM
Metric Evolution Path
Start as
for rapid iteration. Once the prompt is validated (through labs/feedback), convert to
with section extraction for production. Cekura allows a metric to have both
(llm_judge prompt) AND
— the active type is toggled. This means the LLM prompt can be refined through labs, then the custom_code version uses that same prompt with targeted context extraction.
Eval Types
| Eval Type | Output | Use For |
|---|
| TRUE/FALSE | Soft skills, quality assessments |
binary_workflow_adherence
| TRUE/FALSE | Flow compliance checks |
| String from defined values | Classification tasks |
| Float score | Scoring tasks |
| Continuous score | Continuous quality assessment |
LLM Judge Prompt Structure
Two proven structures exist. See
references/prompt-patterns.md
for full templates.
Structure A: Sectioned (best for multi-criteria metrics)
- SCOPE & FOCUS — What this metric evaluates ONLY + what to IGNORE
- DO NOT FLAG — Common false positives: behavioral patterns that look like fails but aren't for THIS metric
- INPUTS — Only relevant template variables
- SECTIONS — Numbered evaluation criteria with pass/fail examples
- FAILURE CONDITIONS (Only These Count) — Narrow, closed list of what constitutes a failure
- SAFEGUARDING NOTES — Spirit vs letter overrides
- OUTPUT INSTRUCTIONS — Return format, timestamps for failures
Structure B: Narrative (best for behavioral/timing metrics)
- SCOPE & FOCUS — What this metric evaluates ONLY + what to IGNORE
- DO NOT FLAG — Common false positives specific to this metric
- CONTEXT — What this call type looks like, why the metric matters
- WHAT TO LOOK FOR — Specific items in the transcript (tool names, phrases, patterns)
- FAILURE CONDITIONS (Only These Count) — Narrow, closed list of specific failure patterns
- NUANCES — Edge cases, overrides, things that look like fails but aren't
- OUTPUT — TRUE/FALSE/N/A with timestamps and evidence
Being explicit about PASS vs FAIL with real examples from actual conversations is the single most impactful thing for prompt quality. Vague criteria like "agent should be responsive" produce inconsistent results.
Anti-Cross-Pollination Scoping (when using )
The most common source of false failures: a metric using
fails based on rules from an unrelated flow (e.g., Emergency metric fails because of a Booking Flow rule).
Three-layer scoping pattern: SCOPE & FOCUS ("evaluates X ONLY"), DO NOT FLAG (enumerate false positives by behavioral pattern), FAILURE CONDITIONS (narrow closed list).
See references/advanced-patterns.md
for full structure and the rule that all scoping language must be concept-based, never hardcoded to a specific agent's section names.
Available Template Variables
| Variable | Description |
|---|
| Full conversation text |
| Structured transcript with timestamps |
| Full blob of custom variables from calls |
{{dynamic_variables.keyName}}
| Specific dynamic variable by key (dot notation preferred) |
| Main agent's system prompt |
| Call metadata |
| How the call ended |
Include only variables relevant to the specific metric. Listing all variables creates noise and dilutes evaluation focus.
When using
, point to specific metadata fields the LLM judge should reference (e.g., "Check
to verify booking was created").
The Spirit vs Letter Principle
This is the most critical concept in metric design.
Agent descriptions describe the intended functionality of the main agent, but must not be taken literally by the evaluator. Understand the intent behind each instruction and write the metric to capture the spirit, not the literal text.
Example: Agent description says "ask only 1 question at a time"
- Spirit: Prevent cognitive overload on the caller
- Literal (wrong): Fail any turn with more than one question mark
- Correct metric behavior:
- PASS: "Are you the owner of 123 Easy St? Can I get your name?" (related data cluster)
- PASS: "Is this a new issue, or an existing one?" (A/B rephrasing = single question)
- FAIL: "Does Thursday work? Also, did you get our text message?" (unrelated questions)
When uncertain about the intent behind an instruction, ask the user to clarify before encoding it into the metric. Include explicit safeguarding examples in the prompt showing what should and should not be penalized.
Trigger Design
| Trigger Type | When to Use |
|---|
| Metrics that apply to every call (soft skills, business context) |
| with trigger | Conditional metrics (booking flow only fires when booking intent detected) |
| with trigger | Complex trigger logic requiring code |
| Let Cekura auto-determine (less control) |
Use the
generate_evaluation_trigger
endpoint (see
references/api-reference.md
) to auto-generate trigger prompts from metric descriptions. Triggers can be layered in specificity — e.g., one trigger fires on any onboarding call, another fires only when the user gets stuck.
Two-Layer N/A Strategy
Triggers and metric descriptions handle N/A at different levels:
- Trigger-level N/A (first defense): The trigger gates out obviously irrelevant calls BEFORE the metric runs. This saves LLM cost. Example: a Booking Flow metric's trigger checks if booking intent exists — if not, the metric doesn't fire and outputs N/A.
- Description-level N/A (nuanced cases): The metric prompt itself handles edge cases that need transcript context to determine. Example: a call has booking intent (trigger fires) but the caller hangs up before the flow starts — the metric description returns N/A with "VALID_SKIP: caller disconnected before booking could begin."
Design triggers to catch the obvious non-applicable calls; design the metric prompt to handle the nuanced edge cases that require reading the transcript.
Trigger Prompt Template
Write triggers with the positive-then-negative pattern:
Evaluate whether this call involves [specific scenario].
Return TRUE if ANY of these indicators are present:
- [Positive indicator 1]
- [Positive indicator 2]
Do NOT trigger if ANY of these apply:
- Call is under 30 seconds or contains no substantive interaction beyond a greeting
- Line disconnection / voicemail / outbound non-engagement
- [Specific exclusion for this metric — e.g., "Emergency-flow transfers (covered by Emergency metric)"]
- [Another exclusion]
Be inclusive — if there's reasonable evidence the scenario occurred, return TRUE.
Always include the short-call exclusion. Calls under ~30 seconds (hang-ups, wrong numbers, voicemails) produce false positives/negatives on every metric. Gate them out at the trigger level.
Trigger Produces N/A Output
When
evaluation_trigger: "custom"
and the trigger returns false, the metric outputs N/A — it is not evaluated. This means even binary metrics (True/False) can have three outcomes: True, False, or N/A. This is correct behavior for conditional metrics.
Key Patterns
VALID_SKIP Pattern
For legitimate deviations where the metric should not apply (tool failures, user hangup before flow starts, caller requesting transfer immediately). The LLM returns explanation starting with "VALID_SKIP:" and the custom_code wrapper converts to
.
Gated Metrics
Access upstream metric results via
data.get("Exact Metric Name")
. The key must match the upstream metric's
field exactly. Use to branch evaluation logic based on prior classification.
Pythonic Section Extraction
Extract only relevant sections from agent description before passing to LLM. Prevents context drift from irrelevant description sections and reduces token usage. See
references/pythonic-patterns.md
for the extraction utility.
N/A Conditions
Check first for conditions where the metric should not apply:
- Immediate transfer/human request within first 1-2 exchanges
- Caller hangup before flow begins
- Out-of-scope caller (wrong number, sales call)
- Infrastructure failure preventing flow execution
- Agent description lacks the relevant section (for optional workflows)
Dynamic Variable-Driven Generalized Metrics
For clients that inject per-call
(e.g., per-node system prompts, feature flags, employment types), create metrics that adapt to each call instead of hardcoding behavior.
Pattern: one metric per injected prompt variable. Each metric references ONLY its specific
{{dynamic_variables.promptName}}
, not the full blob or
.
See references/advanced-patterns.md
for the example prompt structure and the discovery workflow for finding dynamic variables in real calls.
Tool Call Hallucination Metrics
For agents with detailed tool definitions, build a metric that evaluates whether the agent called the correct tool for each situation — "action hallucination" (wrong action) vs "fact hallucination" (wrong information). Pattern: extract tool→scenario mapping from agent description, encode as explicit FAILURE CONDITIONS (closed list), DO NOT FLAG API errors / known quirks.
See references/advanced-patterns.md
for the full prompt structure and TOOL-TO-SCENARIO MAPPING template.
Baseline Metrics — Always Recommend
Every agent should have at minimum these predefined metrics enabled for both observability and simulations:
| Metric | Purpose | Why It Matters |
|---|
| Expected Outcome | Checks if the agent achieved the scenario's expected result | Without this, runs pass/fail based only on call completion — not correctness |
| Infrastructure Issues | Flags silent periods, connection drops, agent non-response | Catches issues like agent going silent for 10+ seconds that aren't visible in pass/fail |
| Tool Call Success | Monitors whether tool calls succeed or fail | Requires provider integration (assistant IDs + API keys) to get toolcall data in transcripts |
| Latency | Measures response time | Identifies performance degradation |
Two-step activation required: Metrics must be (1) toggled on for simulations at the project level AND (2) added to individual evaluators. Missing either step means metrics won't fire. Without metrics enabled, users get false passes and must manually review every run.
Expected Outcome is transcript-only — it cannot evaluate audio-layer behavior. Expected Outcome reads the conversation text to determine whether the agent achieved its goal. It has no visibility into silences, interruptions, barge-ins, audio dropouts, or other voice-channel signals. Do not rely on Expected Outcome to catch these. For anything that depends on the audio stream rather than conversation content, use predefined metrics instead.
Toolcall data prerequisite: Tool Call Success and advanced monitoring require the agent to have its provider assistant ID configured on Cekura and complete call data being sent. If transcripts are missing toolcall data, recommend the user configure their provider integration.
Output Requirements
All metrics must require:
- Brief explanation of the result (what happened and why)
- For failures: specific timestamps in MM:SS format where violations occurred
- For metadata-based checks: reference the specific metadata fields examined
Common Custom Metrics Worth Suggesting
Beyond the baseline predefined metrics, these are commonly valuable custom metrics based on patterns seen across clients:
- Question stacking / information dumping — Agent asking 3+ unrelated questions in one turn or dumping large blocks of information. Poor UX that overwhelms callers.
- Workflow adherence — Agent follows the defined flow steps in order (booking, verification, cancellation, etc.)
- Soft skills — Tone, empathy, appropriate greetings, not exposing system internals
- Business context accuracy — Agent provides correct business information (hours, addresses, pricing)
- Transfer/callback handling — Agent follows proper protocol when transferring or scheduling callbacks
Operational Rules
Cost Guard — Never Evaluate >100 Calls Without Confirmation
Each evaluation costs the client real money. Before evaluating metrics on a batch of calls, ALWAYS query the call count first (use
and read the response) and report the number to the user. If count > 100, stop and ask for explicit approval before proceeding. Use
parameter (up to 200) instead of paginating, and use server-side filters (
,
,
/
) to scope calls.
Manual Fix First, Then Labs
When metrics have systemic issues (high false-fail rates), do NOT jump straight to labs feedback. Instead:
- Read failure explanations and categorize root causes (e.g., extra_questions, end_protocol, should_be_na)
- Write manual prompt fixes targeting the dominant failure categories
- PATCH the updated descriptions via API
- Re-evaluate a sample of 20-30 calls per metric to validate the fixes
- THEN use labs feedback for remaining edge cases that manual fixes didn't catch
This avoids wasting labs iterations on issues that are clearly fixable by prompt editing.
Common Pitfalls
- Writing metrics without reading real transcripts first — always fetch and study actual transcript_json before writing
- Putting the prompt in field instead of for llm_judge
- Using deprecated types (, ) — API returns 400
- Using for checks the LLM can handle naturally (timestamps, tool call detection)
- Not matching upstream metric name exactly for gated metrics
- Passing full agent description when only a section is relevant (context drift)
- Missing VALID_SKIP handling in custom_code metrics
- No N/A conditions for conditional metrics
- Taking agent description instructions literally instead of capturing their spirit
- Not including safeguarding examples for nuanced evaluation criteria
- Omitting timestamps in failure explanations
Next Steps
After creating a metric, the user typically needs:
- Validate it on real calls → use the evaluate-calls flow (see
references/api-reference.md
)
- Iterate on accuracy → invoke cekura-metric-improvement to run the labs feedback cycle
- Design test scenarios that exercise this metric → invoke cekura-eval-design
Documentation
See
references/api-reference.md
for complete endpoint documentation and field schemas.
Additional Resources
Reference Files (loaded on demand)
references/prompt-patterns.md
— Full LLM judge prompt templates with real examples
references/pythonic-patterns.md
— Section extraction utility and custom_code patterns
references/advanced-patterns.md
— Anti-cross-pollination scoping, dynamic-variable-driven metrics, tool-call hallucination metrics
references/api-reference.md
— Complete Cekura metrics API endpoints and schemas
Example Files
examples/llm-judge-metric.md
— Complete llm_judge metric example (sectioned structure)
examples/narrative-metric.md
— Complete llm_judge metric example (narrative structure)
examples/custom-code-metric.py
— Complete custom_code metric with gating and VALID_SKIP
examples/section-extraction-metric.py
— Pythonic metric with agent description scoping