Cekura Metric Improvement (Labs Workflow)
Purpose
Guide the metric improvement cycle: identify misaligned metric results, leave structured feedback, run the labs improvement pipeline, and validate changes. This workflow transforms metric quality from initial draft to production-ready through systematic iteration.
Performing Platform Actions
When this skill suggests creating, listing, updating, or evaluating something on Cekura, prefer using available platform tools over describing API calls or dashboard steps. In Claude Code with the Cekura plugin installed, these tools are auto-configured and handle authentication, parameter validation, and error handling for you. Fall back to direct API endpoints or dashboard guidance only when no tools are available in the current session.
Manual Fix First, Then Labs
When metrics have systemic issues (high false-fail rates), do NOT jump straight to labs feedback. Instead:
- Read failure explanations and categorize root causes (e.g., cross-pollination from other flows, extra_questions flagged, end-of-call protocol violations, should-be-N/A cases)
- Write manual prompt fixes targeting the dominant failure categories — add SCOPE & FOCUS, DO NOT FLAG, narrow FAILURE CONDITIONS
- PATCH the updated descriptions via API
- Re-evaluate a sample of 20-30 calls per metric to validate the fixes
- THEN use labs feedback for remaining edge cases that manual fixes didn't catch
This avoids wasting labs iterations on issues that are clearly fixable by prompt editing. Labs is for nuanced edge cases, not systemic prompt design flaws.
The Labs Improvement Cycle
For edge case refinement after manual fixes are validated:
- Identify misalignment — Find calls where metric results seem wrong
- Leave feedback — Vote agree/disagree with explanation on specific results
- Accumulate feedback — Collect 6+ feedback instances for statistical significance
- Run auto-improve — Labs uses feedback to suggest metric prompt changes
- Validate changes — Re-run the improved metric on the same calls to verify
- Deploy — Once satisfied, optionally convert to Pythonic custom_code for production
Step 1: Identify Misaligned Results
Manual Approach
Review recent call evaluations to find suspicious results:
List recent calls (with agent filters) and retrieve specific calls to get evaluation results — see "API Endpoints Reference" below.
Look for:
- FALSE results that seem like they should be TRUE (false negatives)
- TRUE results that seem wrong (false positives)
- Unexpected N/A results
- Inconsistent results across similar calls
Guided Approach (Simulate Labs)
To systematically find misalignment:
- Pull metric results for recent calls via or
- Focus on metrics with high variance or unexpected result distributions
- Read the transcript alongside the metric explanation
- Ask the user: "This call was marked [TRUE/FALSE]. The explanation says [X]. Does this seem correct?"
- If the user disagrees, proceed to leave feedback
Step 2: Leave Feedback
Use the
endpoint to leave structured feedback. First retrieve the call to find the metric result, then POST the feedback (see "API Endpoints Reference" below).
Good Feedback Patterns
- Reference specific transcript moments: "At 02:15, the agent said X which should be [PASS/FAIL] because..."
- Explain the reasoning: "The metric is being too literal about the one-question rule"
- Suggest the correct outcome: "This should be TRUE because the agent was confirming related details"
- Point out missing context: "The metric didn't account for the tool failure at 01:30"
Bad Feedback Patterns
- Vague: "This is wrong" (no explanation)
- No reference: "Should be TRUE" (no transcript evidence)
- Contradictory: Disagreeing without explaining what the correct behavior should be
Step 3: Accumulate Feedback
Collect at least 6 feedback instances before running auto-improve. This gives labs enough signal to identify patterns in the feedback and make meaningful prompt adjustments.
Track feedback progress:
- How many agree/disagree votes have been left
- What patterns emerge (e.g., "metric is consistently too strict about X")
- Whether feedback is balanced across different call types
Step 4: Run Auto-Improve
Once 6+ feedback instances are accumulated:
Trigger auto-improvement via
POST /test_framework/metric-reviews/process_feedbacks/
with the metric ID.
Labs analyzes the feedback and suggests changes to the metric prompt. Review the suggested changes carefully:
- Do the changes address the feedback patterns?
- Are the changes too broad (might break other cases)?
- Do the safeguarding examples align with the feedback?
Cost Guard — Never Evaluate >100 Calls Without Confirmation
Each evaluation costs the client real money. Before triggering any bulk evaluation:
- Query the call count first (use and read the response count)
- Report the number to the user
- If count > 100, stop and ask for explicit approval before proceeding
Use
parameter (up to 200) instead of paginating through multiple pages. Use server-side filters (
,
,
/
) to scope calls before evaluating.
Step 5: Validate Changes
Re-run the improved metric on the same calls that had misaligned results:
Use
POST /observability/v1/call-logs/rerun_evaluation/
with the call IDs and metric ID.
Check:
- Do the previously misaligned calls now produce correct results?
- Do previously correct calls still produce correct results (no regression)?
- Are the explanations clearer and more accurate?
If validation fails, leave additional feedback and iterate.
Step 6: Deploy (Optional Pythonic Conversion)
Once the metric prompt is validated through labs, consider converting to a Pythonic custom_code metric for production:
- Take the final validated prompt from the llm_judge field
- Wrap it in custom_code with section extraction (see metric-design skill)
- Set both (the prompt) and (the Python wrapper)
- Toggle to custom_code as the active type
This gives the benefit of the labs-refined prompt with the performance advantage of targeted context extraction.
Interactive Labs Simulation
When the user wants to simulate the labs workflow interactively:
- Fetch recent calls and metric results
- Present potentially misaligned results to the user one at a time
- Ask: "Call [ID] was scored [result]. Here's the explanation: [explanation]. The transcript shows: [relevant excerpt]. Do you agree with this result?"
- If user disagrees, generate the feedback payload and submit via mark_metric_vote
- Track feedback count and notify when 6+ is reached
- Offer to trigger auto-improve
Next Steps
After improving a metric, the user typically needs:
- Re-run on a fresh call sample to confirm the fix sticks (use the evaluate-calls / rerun_evaluation endpoints)
- Apply similar patterns to other metrics → invoke cekura-metric-design for new metrics that need the same scoping
- Validate against test scenarios → invoke cekura-eval-design if metric behavior depends on specific eval flows
API Endpoints Reference
| Endpoint | Purpose |
|---|
GET /observability/v1/call-logs-external/?agent=ID
| List calls |
GET /observability/v1/call-logs-external/{id}/
| Get call details + evaluation results |
POST /observability/v1/call-logs-external/{id}/mark_metric_vote/
| Leave feedback |
POST /test_framework/metric-reviews/process_feedbacks/
| Run labs auto-improve (see below) |
GET /test_framework/metric-reviews/process_feedbacks_progress/
| Poll improvement progress |
POST /observability/v1/call-logs/evaluate_metrics/
| Evaluate specific metrics on calls |
POST /observability/v1/call-logs/rerun_evaluation/
| Re-run evaluation on calls |
POST /test_framework/test-sets/create_from_call_log/
| Create test set from call log |
Labs Auto-Improve (process_feedbacks)
json
POST /test_framework/metric-reviews/process_feedbacks/
{
"metric_id": 123,
"test_set_ids": [456, 789]
}
Optional fields:
(default "llm_judge"),
(bool),
(string).
Returns
{"progress_id": "<uuid>"}
. Poll at
GET /test_framework/metric-reviews/process_feedbacks_progress/?progress_id=<uuid>
.
The response includes improved
and
when complete —
you must PATCH the metric to apply changes (they are not auto-applied).
Create Test Set from Call Log
json
POST /test_framework/test-sets/create_from_call_log/
{
"call_log_id": 3358270,
"metrics": [{"metric": 123, "feedback": "The metric incorrectly failed this call because..."}]
}
Note: must be an array of objects
[{"metric": <id>, "feedback": "<text>"}]
, NOT bare metric IDs. Passing bare IDs returns 500.
Additional Resources
Reference Files
references/feedback-examples.md
— Examples of good feedback for different metric types