cekura-metric-improvement

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cekura Metric Improvement (Labs Workflow)

Cekura 指标改进(实验室工作流)

Purpose

目的

Guide the metric improvement cycle: identify misaligned metric results, leave structured feedback, run the labs improvement pipeline, and validate changes. This workflow transforms metric quality from initial draft to production-ready through systematic iteration.
指导指标改进周期:识别不一致的指标结果、提交结构化反馈、运行实验室改进流水线并验证变更。该工作流通过系统性迭代,将指标质量从初始草稿阶段提升至可投产状态。

Performing Platform Actions

执行平台操作

When this skill suggests creating, listing, updating, or evaluating something on Cekura, prefer using available platform tools over describing API calls or dashboard steps. In Claude Code with the Cekura plugin installed, these tools are auto-configured and handle authentication, parameter validation, and error handling for you. Fall back to direct API endpoints or dashboard guidance only when no tools are available in the current session.
当本技能建议在Cekura上创建、列出、更新或评估内容时,优先使用可用的平台工具,而非描述API调用或控制台步骤。在安装了Cekura插件的Claude Code中,这些工具已自动配置,可处理身份验证、参数验证和错误处理。仅当当前会话中无可用工具时,才退而使用直接API端点或控制台操作指引。

Manual Fix First, Then Labs

先手动修复,再使用实验室流程

When metrics have systemic issues (high false-fail rates), do NOT jump straight to labs feedback. Instead:
  1. Read failure explanations and categorize root causes (e.g., cross-pollination from other flows, extra_questions flagged, end-of-call protocol violations, should-be-N/A cases)
  2. Write manual prompt fixes targeting the dominant failure categories — add SCOPE & FOCUS, DO NOT FLAG, narrow FAILURE CONDITIONS
  3. PATCH the updated descriptions via API
  4. Re-evaluate a sample of 20-30 calls per metric to validate the fixes
  5. THEN use labs feedback for remaining edge cases that manual fixes didn't catch
This avoids wasting labs iterations on issues that are clearly fixable by prompt editing. Labs is for nuanced edge cases, not systemic prompt design flaws.
当指标存在系统性问题(高误判率)时,请勿直接跳转至实验室反馈。应遵循以下步骤:
  1. 阅读失败原因说明并归类根本原因(例如:其他流程干扰、标记为extra_questions、违反通话结束协议、应标记为N/A的情况)
  2. 编写针对性的手动提示修复方案——明确SCOPE & FOCUS、DO NOT FLAG规则,缩小FAILURE CONDITIONS范围
  3. 通过API更新描述内容(PATCH请求)
  4. 重新评估每个指标的20-30个样本通话以验证修复效果
  5. 再使用实验室反馈处理手动修复未解决的剩余边缘案例
这可避免将实验室迭代浪费在可通过提示编辑直接解决的问题上。实验室流程适用于细微的边缘案例,而非系统性的提示设计缺陷。

The Labs Improvement Cycle

实验室改进周期

For edge case refinement after manual fixes are validated:
  1. Identify misalignment — Find calls where metric results seem wrong
  2. Leave feedback — Vote agree/disagree with explanation on specific results
  3. Accumulate feedback — Collect 6+ feedback instances for statistical significance
  4. Run auto-improve — Labs uses feedback to suggest metric prompt changes
  5. Validate changes — Re-run the improved metric on the same calls to verify
  6. Deploy — Once satisfied, optionally convert to Pythonic custom_code for production
在手动修复验证完成后,针对边缘案例进行优化:
  1. 识别不一致结果——找出指标结果看似错误的通话
  2. 提交反馈——针对特定结果投票同意/不同意并附上说明
  3. 积累反馈——收集至少6条反馈以保证统计显著性
  4. 运行自动优化——实验室系统将根据反馈建议调整指标提示
  5. 验证变更——在相同通话上重新运行改进后的指标以验证效果
  6. 部署(可选)——若满意,可转换为Pythonic custom_code用于生产环境

Step 1: Identify Misaligned Results

步骤1:识别不一致结果

Manual Approach

手动方法

Review recent call evaluations to find suspicious results:
List recent calls (with agent filters) and retrieve specific calls to get evaluation results — see "API Endpoints Reference" below.
Look for:
  • FALSE results that seem like they should be TRUE (false negatives)
  • TRUE results that seem wrong (false positives)
  • Unexpected N/A results
  • Inconsistent results across similar calls
查看近期通话评估结果以发现可疑情况:
列出近期通话(可按Agent筛选)并获取特定通话的评估结果——详见下文「API端点参考」。
重点关注:
  • 本应为TRUE的FALSE结果(假阴性)
  • 看似错误的TRUE结果(假阳性)
  • 意外的N/A结果
  • 相似通话间的不一致结果

Guided Approach (Simulate Labs)

引导式方法(模拟实验室流程)

To systematically find misalignment:
  1. Pull metric results for recent calls via
    evaluate_calls
    or
    list_calls
  2. Focus on metrics with high variance or unexpected result distributions
  3. Read the transcript alongside the metric explanation
  4. Ask the user: "This call was marked [TRUE/FALSE]. The explanation says [X]. Does this seem correct?"
  5. If the user disagrees, proceed to leave feedback
系统性识别不一致情况:
  1. 通过
    evaluate_calls
    list_calls
    获取近期通话的指标结果
  2. 聚焦于方差高或结果分布异常的指标
  3. 结合通话记录阅读指标说明
  4. 询问用户:「该通话被标记为[TRUE/FALSE],说明为[X]。此结果是否合理?」
  5. 若用户不同意,继续提交反馈

Step 2: Leave Feedback

步骤2:提交反馈

Use the
mark_metric_vote
endpoint to leave structured feedback. First retrieve the call to find the metric result, then POST the feedback (see "API Endpoints Reference" below).
使用
mark_metric_vote
端点提交结构化反馈。先获取通话详情以找到对应指标结果,再提交反馈(详见下文「API端点参考」)。

Good Feedback Patterns

优质反馈范例

  • Reference specific transcript moments: "At 02:15, the agent said X which should be [PASS/FAIL] because..."
  • Explain the reasoning: "The metric is being too literal about the one-question rule"
  • Suggest the correct outcome: "This should be TRUE because the agent was confirming related details"
  • Point out missing context: "The metric didn't account for the tool failure at 01:30"
  • 引用通话记录的特定时刻:「在02:15,Agent说X,此情况应[通过/失败],因为...」
  • 解释理由:「该指标对单问题规则过于死板」
  • 建议正确结果:「此结果应为TRUE,因为Agent正在确认相关细节」
  • 指出缺失的上下文:「该指标未考虑01:30时的工具故障情况」

Bad Feedback Patterns

不良反馈范例

  • Vague: "This is wrong" (no explanation)
  • No reference: "Should be TRUE" (no transcript evidence)
  • Contradictory: Disagreeing without explaining what the correct behavior should be
  • 模糊表述:「这是错的」(无解释)
  • 无引用依据:「应为TRUE」(无通话记录证据)
  • 矛盾表述:不同意结果但未说明正确行为

Step 3: Accumulate Feedback

步骤3:积累反馈

Collect at least 6 feedback instances before running auto-improve. This gives labs enough signal to identify patterns in the feedback and make meaningful prompt adjustments.
Track feedback progress:
  • How many agree/disagree votes have been left
  • What patterns emerge (e.g., "metric is consistently too strict about X")
  • Whether feedback is balanced across different call types
在运行自动优化前,至少收集6条反馈。这能为实验室系统提供足够信号,识别反馈模式并做出有意义的提示调整。
跟踪反馈进度:
  • 已提交的同意/不同意投票数量
  • 出现的模式(例如:「指标在X方面始终过于严格」)
  • 反馈是否覆盖不同类型的通话

Step 4: Run Auto-Improve

步骤4:运行自动优化

Once 6+ feedback instances are accumulated:
Trigger auto-improvement via
POST /test_framework/metric-reviews/process_feedbacks/
with the metric ID.
Labs analyzes the feedback and suggests changes to the metric prompt. Review the suggested changes carefully:
  • Do the changes address the feedback patterns?
  • Are the changes too broad (might break other cases)?
  • Do the safeguarding examples align with the feedback?
积累6条及以上反馈后:
通过
POST /test_framework/metric-reviews/process_feedbacks/
触发自动优化,传入metric ID。
实验室系统将分析反馈并建议调整指标提示。请仔细审核建议的变更:
  • 变更是否解决了反馈中指出的问题?
  • 变更是否过于宽泛(可能影响其他场景)?
  • 防护示例是否与反馈一致?

Cost Guard — Never Evaluate >100 Calls Without Confirmation

成本管控——未获确认前请勿评估超过100条通话

Each evaluation costs the client real money. Before triggering any bulk evaluation:
  1. Query the call count first (use
    page_size=1
    and read the response count)
  2. Report the number to the user
  3. If count > 100, stop and ask for explicit approval before proceeding
Use
page_size
parameter (up to 200) instead of paginating through multiple pages. Use server-side filters (
agent_id
,
project
,
timestamp__gte
/
timestamp__lte
) to scope calls before evaluating.
每次评估都会产生实际成本。触发批量评估前:
  1. 先查询通话数量(使用
    page_size=1
    并读取响应中的计数)
  2. 将数量告知用户
  3. 若数量超过100,停止操作并请求用户明确批准后再继续
使用
page_size
参数(最大200)而非分页查询。在评估前使用服务端筛选器(
agent_id
project
timestamp__gte
/
timestamp__lte
)缩小通话范围。

Step 5: Validate Changes

步骤5:验证变更

Re-run the improved metric on the same calls that had misaligned results:
Use
POST /observability/v1/call-logs/rerun_evaluation/
with the call IDs and metric ID.
Check:
  • Do the previously misaligned calls now produce correct results?
  • Do previously correct calls still produce correct results (no regression)?
  • Are the explanations clearer and more accurate?
If validation fails, leave additional feedback and iterate.
在之前出现不一致结果的通话上重新运行改进后的指标:
使用
POST /observability/v1/call-logs/rerun_evaluation/
,传入通话ID和metric ID。
检查内容:
  • 之前不一致的通话现在是否产生正确结果?
  • 之前正确的通话是否仍保持正确结果(无回归)?
  • 说明是否更清晰准确?
若验证失败,提交额外反馈并重复迭代。

Step 6: Deploy (Optional Pythonic Conversion)

步骤6:部署(可选:转换为Pythonic代码)

Once the metric prompt is validated through labs, consider converting to a Pythonic custom_code metric for production:
  1. Take the final validated prompt from the llm_judge
    description
    field
  2. Wrap it in custom_code with section extraction (see metric-design skill)
  3. Set both
    description
    (the prompt) and
    custom_code
    (the Python wrapper)
  4. Toggle to custom_code as the active type
This gives the benefit of the labs-refined prompt with the performance advantage of targeted context extraction.
当指标提示通过实验室流程验证后,可考虑转换为Pythonic custom_code指标用于生产环境:
  1. 从llm_judge的
    description
    字段获取最终验证后的提示
  2. 使用章节提取功能将其封装为custom_code(详见metric-design技能)
  3. 同时设置
    description
    (提示内容)和
    custom_code
    (Python封装代码)
  4. 将活动类型切换为custom_code
这样既能保留实验室优化后的提示,又能获得针对性上下文提取带来的性能优势。

Interactive Labs Simulation

交互式实验室模拟

When the user wants to simulate the labs workflow interactively:
  1. Fetch recent calls and metric results
  2. Present potentially misaligned results to the user one at a time
  3. Ask: "Call [ID] was scored [result]. Here's the explanation: [explanation]. The transcript shows: [relevant excerpt]. Do you agree with this result?"
  4. If user disagrees, generate the feedback payload and submit via mark_metric_vote
  5. Track feedback count and notify when 6+ is reached
  6. Offer to trigger auto-improve
当用户希望交互式模拟实验室工作流时:
  1. 获取近期通话和指标结果
  2. 逐一向用户展示可能存在不一致的结果
  3. 询问:「通话[ID]的评分为[result],说明如下:[explanation]。通话记录相关片段:[relevant excerpt]。您是否同意该结果?」
  4. 若用户不同意,生成反馈负载并通过mark_metric_vote提交
  5. 跟踪反馈计数,当达到6条及以上时通知用户
  6. 提供触发自动优化的选项

Next Steps

后续步骤

After improving a metric, the user typically needs:
  • Re-run on a fresh call sample to confirm the fix sticks (use the evaluate-calls / rerun_evaluation endpoints)
  • Apply similar patterns to other metrics → invoke cekura-metric-design for new metrics that need the same scoping
  • Validate against test scenarios → invoke cekura-eval-design if metric behavior depends on specific eval flows
改进指标后,用户通常需要:
  • 在新的通话样本上重新运行以确认修复效果(使用evaluate-calls / rerun_evaluation端点)
  • 将相似模式应用于其他指标 → 调用cekura-metric-design处理需要相同范围定义的新指标
  • 针对测试场景验证 → 若指标行为依赖特定评估流程,调用cekura-eval-design

API Endpoints Reference

API端点参考

EndpointPurpose
GET /observability/v1/call-logs-external/?agent=ID
List calls
GET /observability/v1/call-logs-external/{id}/
Get call details + evaluation results
POST /observability/v1/call-logs-external/{id}/mark_metric_vote/
Leave feedback
POST /test_framework/metric-reviews/process_feedbacks/
Run labs auto-improve (see below)
GET /test_framework/metric-reviews/process_feedbacks_progress/
Poll improvement progress
POST /observability/v1/call-logs/evaluate_metrics/
Evaluate specific metrics on calls
POST /observability/v1/call-logs/rerun_evaluation/
Re-run evaluation on calls
POST /test_framework/test-sets/create_from_call_log/
Create test set from call log
端点用途
GET /observability/v1/call-logs-external/?agent=ID
列出通话
GET /observability/v1/call-logs-external/{id}/
获取通话详情 + 评估结果
POST /observability/v1/call-logs-external/{id}/mark_metric_vote/
提交反馈
POST /test_framework/metric-reviews/process_feedbacks/
运行实验室自动优化(详见下文)
GET /test_framework/metric-reviews/process_feedbacks_progress/
查询优化进度
POST /observability/v1/call-logs/evaluate_metrics/
在通话上评估特定指标
POST /observability/v1/call-logs/rerun_evaluation/
在通话上重新运行评估
POST /test_framework/test-sets/create_from_call_log/
从通话记录创建测试集

Labs Auto-Improve (process_feedbacks)

实验室自动优化(process_feedbacks)

json
POST /test_framework/metric-reviews/process_feedbacks/
{
  "metric_id": 123,
  "test_set_ids": [456, 789]
}
Optional fields:
metric_type
(default "llm_judge"),
skip_evaluation
(bool),
simplified_prompt
(string).
Returns
{"progress_id": "<uuid>"}
. Poll at
GET /test_framework/metric-reviews/process_feedbacks_progress/?progress_id=<uuid>
.
The response includes improved
description
and
evaluation_trigger
when complete — you must PATCH the metric to apply changes (they are not auto-applied).
json
POST /test_framework/metric-reviews/process_feedbacks/
{
  "metric_id": 123,
  "test_set_ids": [456, 789]
}
可选字段:
metric_type
(默认值为"llm_judge")、
skip_evaluation
(布尔值)、
simplified_prompt
(字符串)。
返回
{"progress_id": "<uuid>"}
。通过
GET /test_framework/metric-reviews/process_feedbacks_progress/?progress_id=<uuid>
查询进度。
完成后,响应将包含改进后的
description
evaluation_trigger
——您必须通过PATCH请求更新指标以应用变更(变更不会自动生效)。

Create Test Set from Call Log

从通话记录创建测试集

json
POST /test_framework/test-sets/create_from_call_log/
{
  "call_log_id": 3358270,
  "metrics": [{"metric": 123, "feedback": "The metric incorrectly failed this call because..."}]
}
Note:
metrics
must be an array of objects
[{"metric": <id>, "feedback": "<text>"}]
, NOT bare metric IDs. Passing bare IDs returns 500.
json
POST /test_framework/test-sets/create_from_call_log/
{
  "call_log_id": 3358270,
  "metrics": [{"metric": 123, "feedback": "The metric incorrectly failed this call because..."}]
}
注意:
metrics
必须是对象数组
[{"metric": <id>, "feedback": "<text>"}]
,而非单纯的metric ID。传入单纯ID会返回500错误。

Additional Resources

附加资源

Reference Files

参考文件

  • references/feedback-examples.md
    — Examples of good feedback for different metric types
  • references/feedback-examples.md
    —— 不同类型指标的优质反馈示例