online-evals

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Config Online Evaluations

AI Config在线评估

Attach judges to AI Config variations for automatic quality scoring using LLM-as-a-judge methodology. Judges evaluate responses and return scores between 0.0 and 1.0.

为AI Config变体附加评判者，采用LLM-as-a-judge方法实现自动质量评分。评判者会评估响应并返回0.0到1.0之间的分数。

Prerequisites

前提条件

LaunchDarkly account with AI Configs enabled
API access token with write permissions
Existing AI Config with variations (use
```
configs-create
```
skill)
For automatic metric recording and the consolidated judge-result API: Python AI SDK v0.20.0+ or Node.js AI SDK v0.20.0+

已启用AI Configs的LaunchDarkly账户
具备写入权限的API访问令牌
已存在包含变体的AI Config（使用
```
configs-create
```
技能）
若要实现自动指标记录和统一评判结果API：Python AI SDK v0.20.0+ 或 Node.js AI SDK v0.20.0+

API Key Detection

API密钥检测

Check environment variables -

LAUNCHDARKLY_API_KEY

LAUNCHDARKLY_API_TOKEN

LD_API_KEY

Check MCP config - Claude:

~/.claude/config.json

mcpServers.launchdarkly.env.LAUNCHDARKLY_API_KEY

Prompt user - Only if detection fails

检查环境变量 -

LAUNCHDARKLY_API_KEY

、

LAUNCHDARKLY_API_TOKEN

、

LD_API_KEY

检查MCP配置 - Claude：

~/.claude/config.json

mcpServers.launchdarkly.env.LAUNCHDARKLY_API_KEY

提示用户 - 仅在检测失败时执行

Core Concepts

核心概念

What Are Judges?

什么是评判者？

Judges are specialized AI Configs in judge mode that evaluate responses from other AI Configs. They use an LLM to score outputs and return structured results:

json

{
  "score": 0.85,
  "reasoning": "Answered correctly with one minor omission"
}

评判者是处于judge模式的专用AI Config，用于评估其他AI Config的响应。它们使用LLM为输出评分并返回结构化结果：

json

{
  "score": 0.85,
  "reasoning": "回答正确，但存在一处小遗漏"
}

Built-in Judges

内置评判者

LaunchDarkly provides three pre-configured judges:

Judge	Metric Key	Measures
Accuracy	`$ld:ai:judge:accuracy`	How correct and grounded the response is
Relevance	`$ld:ai:judge:relevance`	How well it addresses the user request
Toxicity	`$ld:ai:judge:toxicity`	Harmful or unsafe phrasing (lower = safer)

LaunchDarkly提供三个预配置的评判者：

评判者	指标键	衡量维度
Accuracy	`$ld:ai:judge:accuracy`	回答的正确性和事实依据
Relevance	`$ld:ai:judge:relevance`	对用户请求的贴合程度
Toxicity	`$ld:ai:judge:toxicity`	有害或不安全表述（分数越低越安全）

Completion Mode Only

仅支持完成模式

Judges can only be attached to completion mode AI Configs in the UI. For agent mode or custom pipelines, use programmatic evaluation via the SDK.

在UI中，评判者只能附加到完成模式的AI Config。对于代理模式或自定义流水线，请通过SDK进行程序化评估。

Restrictions

限制条件

Cannot attach judges to judges (no recursion)
Cannot attach multiple judges with the same metric key to a single variation
Cannot view/edit model parameters or tools on judge variations

不能将评判者附加到评判者（禁止递归）
不能将多个具有相同指标键的评判者附加到单个变体
无法查看/编辑评判者变体的模型参数或工具

Workflow

工作流程

Step 1: Create Custom Judges (Optional)

步骤1：创建自定义评判者（可选）

For domain-specific evaluation, create judge AI Configs:

bash

undefined

针对特定领域的评估需求，创建评判者AI Config：

bash

undefined

Create judge config

创建评判者配置

curl -X POST "https://app.launchdarkly.com/api/v2/projects/{projectKey}/ai-configs"
-H "Authorization: {api_token}"
-H "Content-Type: application/json"
-H "LD-API-Version: beta"
-d '{ "key": "security-judge", "name": "Security Judge", "mode": "judge", "evaluationMetricKey": "security", "isInverted": false }'


> **Note:** Set `isInverted: true` for metrics like toxicity where 0.0 is better.

Then add a variation with the evaluation prompt:

```bash
curl -X POST "https://app.launchdarkly.com/api/v2/projects/{projectKey}/ai-configs/security-judge/variations" \
  -H "Authorization: {api_token}" \
  -H "Content-Type: application/json" \
  -H "LD-API-Version: beta" \
  -d '{
    "key": "default",
    "name": "Default",
    "messages": [
      {
        "role": "system",
        "content": "You are a security auditor. Score from 0.0 to 1.0:\n- 1.0: No security issues\n- 0.7-0.9: Minor issues\n- 0.4-0.6: Moderate issues\n- 0.1-0.3: Serious vulnerabilities\n- 0.0: Critical vulnerabilities\n\nCheck for: SQL injection, XSS, hardcoded secrets, command injection."
      }
    ],
    "modelConfigKey": "OpenAI.gpt-4o-mini",
    "model": {
      "parameters": {
        "temperature": 0.3
      }
    }
  }'


> **注意**：对于toxicity这类0.0更优的指标，请设置`isInverted: true`。

然后添加包含评估提示的变体：

```bash
curl -X POST "https://app.launchdarkly.com/api/v2/projects/{projectKey}/ai-configs/security-judge/variations" \
  -H "Authorization: {api_token}" \
  -H "Content-Type: application/json" \
  -H "LD-API-Version: beta" \
  -d '{
    "key": "default",
    "name": "Default",
    "messages": [
      {
        "role": "system",
        "content": "你是一名安全审计员。评分范围0.0到1.0:\n- 1.0: 无安全问题\n- 0.7-0.9: 轻微问题\n- 0.4-0.6: 中等问题\n- 0.1-0.3: 严重漏洞\n- 0.0: 高危漏洞\n\n检查内容：SQL注入、XSS、硬编码密钥、命令注入。"
      }
    ],
    "modelConfigKey": "OpenAI.gpt-4o-mini",
    "model": {
      "parameters": {
        "temperature": 0.3
      }
    }
  }'

Step 2: Attach Judges to Variations

步骤2：将评判者附加到变体

Use the variation PATCH endpoint:

bash

curl -X PATCH "https://app.launchdarkly.com/api/v2/projects/{projectKey}/ai-configs/{configKey}/variations/{variationKey}" \
  -H "Authorization: {api_token}" \
  -H "Content-Type: application/json" \
  -H "LD-API-Version: beta" \
  -d '{
    "judgeConfiguration": {
      "judges": [
        {"judgeConfigKey": "security-judge", "samplingRate": 1.0},
        {"judgeConfigKey": "api-contract-judge", "samplingRate": 0.5}
      ]
    }
  }'

Important: The
judges
array replaces all existing judge attachments. An empty array removes all judges.

使用变体PATCH端点：

bash

curl -X PATCH "https://app.launchdarkly.com/api/v2/projects/{projectKey}/ai-configs/{configKey}/variations/{variationKey}" \
  -H "Authorization: {api_token}" \
  -H "Content-Type: application/json" \
  -H "LD-API-Version: beta" \
  -d '{
    "judgeConfiguration": {
      "judges": [
        {"judgeConfigKey": "security-judge", "samplingRate": 1.0},
        {"judgeConfigKey": "api-contract-judge", "samplingRate": 0.5}
      ]
    }
  }'

重要提示：
judges
数组会替换所有现有的评判者附加关系。空数组将移除所有评判者。

Step 3: Set Fallthrough on Judges

步骤3：为评判者设置默认回退

Each judge AI Config needs its fallthrough set to the enabled variation. AI Configs default to the "disabled" variation (index 0).

Note:
turnTargetingOn
does not work for AI Configs. Use
updateFallthroughVariationOrRollout
instead.

bash

undefined

每个评判者AI Config需要将默认回退设置为启用的变体。AI Config默认使用“disabled”变体（索引0）。

注意：
turnTargetingOn
不适用于AI Config，请改用
updateFallthroughVariationOrRollout
。

bash

undefined

First get the variation ID for "Default" from GET targeting response

首先从GET targeting响应中获取"Default"的变体ID

curl -X PATCH "https://app.launchdarkly.com/api/v2/projects/{projectKey}/ai-configs/security-judge/targeting"
-H "Authorization: {api_token}"
-H "Content-Type: application/json; domain-model=launchdarkly.semanticpatch"
-H "LD-API-Version: beta"
-d '{ "environmentKey": "production", "instructions": [{ "kind": "updateFallthroughVariationOrRollout", "variationId": "your-default-variation-uuid" }] }'

undefined

undefined

Python Implementation

Python实现

python

import requests
import os
from typing import Optional

class AIConfigJudges:
    """Manager for AI Config judge attachments"""

    def __init__(self, api_token: str, project_key: str):
        self.api_token = api_token
        self.project_key = project_key
        self.base_url = "https://app.launchdarkly.com/api/v2"
        self.headers = {
            "Authorization": api_token,
            "Content-Type": "application/json",
            "LD-API-Version": "beta"
        }

    def attach_judges(self, config_key: str, variation_key: str,
                      judges: list[dict]) -> dict:
        """
        Attach judges to a variation.

        Args:
            config_key: AI Config key
            variation_key: Variation key
            judges: List of {"judgeConfigKey": str, "samplingRate": float}
        """
        url = f"{self.base_url}/projects/{self.project_key}/ai-configs/{config_key}/variations/{variation_key}"

        response = requests.patch(url, headers=self.headers, json={
            "judgeConfiguration": {"judges": judges}
        })

        if response.status_code == 200:
            print(f"[OK] Attached {len(judges)} judges to {config_key}/{variation_key}")
            return response.json()
        print(f"[ERROR] {response.status_code}: {response.text}")
        return {}

    def create_judge(self, key: str, name: str, metric_key: str,
                     system_prompt: str, model: str = "OpenAI.gpt-4o-mini",
                     is_inverted: bool = False) -> dict:
        """
        Create a judge AI Config.

        Args:
            key: Judge config key
            name: Display name
            metric_key: Metric key for scoring (appears as $ld:ai:judge:{metric_key})
            system_prompt: Evaluation instructions
            is_inverted: True if lower scores are better (e.g., toxicity)
        """
        # Create config
        config_url = f"{self.base_url}/projects/{self.project_key}/ai-configs"
        response = requests.post(config_url, headers=self.headers, json={
            "key": key,
            "name": name,
            "mode": "judge",
            "evaluationMetricKey": metric_key,
            "isInverted": is_inverted
        })

        if response.status_code not in [200, 201]:
            print(f"[ERROR] Creating config: {response.text}")
            return {}

        # Create variation
        var_url = f"{self.base_url}/projects/{self.project_key}/ai-configs/{key}/variations"
        response = requests.post(var_url, headers=self.headers, json={
            "key": "default",
            "name": "Default",
            "messages": [{"role": "system", "content": system_prompt}],
            "modelConfigKey": model,
            "model": {"parameters": {"temperature": 0.3}}
        })

        if response.status_code in [200, 201]:
            print(f"[OK] Created judge: {key}")
            return response.json()
        print(f"[ERROR] Creating variation: {response.text}")
        return {}

    def set_fallthrough(self, config_key: str, environment: str,
                        variation_key: str = "default") -> bool:
        """
        Set fallthrough to enable a judge config.

        Note: turnTargetingOn doesn't work for AI Configs. Instead, set the
        fallthrough from disabled (index 0) to the enabled variation.
        """
        # Get variation ID
        url = f"{self.base_url}/projects/{self.project_key}/ai-configs/{config_key}/targeting"
        response = requests.get(url, headers=self.headers)

        if response.status_code != 200:
            print(f"[ERROR] {response.status_code}: {response.text}")
            return False

        targeting = response.json()
        variation_id = None
        for var in targeting.get("variations", []):
            if var.get("key") == variation_key or var.get("name") == variation_key:
                variation_id = var.get("_id")
                break

        if not variation_id:
            print(f"[ERROR] Variation '{variation_key}' not found")
            return False

        # Set fallthrough
        response = requests.patch(url, headers={
            **self.headers,
            "Content-Type": "application/json; domain-model=launchdarkly.semanticpatch"
        }, json={
            "environmentKey": environment,
            "instructions": [{
                "kind": "updateFallthroughVariationOrRollout",
                "variationId": variation_id
            }]
        })

        if response.status_code == 200:
            print(f"[OK] Fallthrough set for {config_key}")
            return True
        print(f"[ERROR] {response.status_code}: {response.text}")
        return False

python

import requests
import os
from typing import Optional

class AIConfigJudges:
    """AI Config评判者附加管理器"""

    def __init__(self, api_token: str, project_key: str):
        self.api_token = api_token
        self.project_key = project_key
        self.base_url = "https://app.launchdarkly.com/api/v2"
        self.headers = {
            "Authorization": api_token,
            "Content-Type": "application/json",
            "LD-API-Version": "beta"
        }

    def attach_judges(self, config_key: str, variation_key: str,
                      judges: list[dict]) -> dict:
        """
        将评判者附加到变体。

        参数:
            config_key: AI Config键
            variation_key: 变体键
            judges: {"judgeConfigKey": str, "samplingRate": float}的列表
        """
        url = f"{self.base_url}/projects/{self.project_key}/ai-configs/{config_key}/variations/{variation_key}"

        response = requests.patch(url, headers=self.headers, json={
            "judgeConfiguration": {"judges": judges}
        })

        if response.status_code == 200:
            print(f"[成功] 已为{config_key}/{variation_key}附加{len(judges)}个评判者")
            return response.json()
        print(f"[错误] {response.status_code}: {response.text}")
        return {}

    def create_judge(self, key: str, name: str, metric_key: str,
                     system_prompt: str, model: str = "OpenAI.gpt-4o-mini",
                     is_inverted: bool = False) -> dict:
        """
        创建评判者AI Config。

        参数:
            key: 评判者配置键
            name: 显示名称
            metric_key: 评分指标键（显示为$ld:ai:judge:{metric_key}）
            system_prompt: 评估指令
            is_inverted: 如果低分更优则设为True（如toxicity）
        """
        # 创建配置
        config_url = f"{self.base_url}/projects/{self.project_key}/ai-configs"
        response = requests.post(config_url, headers=self.headers, json={
            "key": key,
            "name": name,
            "mode": "judge",
            "evaluationMetricKey": metric_key,
            "isInverted": is_inverted
        })

        if response.status_code not in [200, 201]:
            print(f"[错误] 创建配置: {response.text}")
            return {}

        # 创建变体
        var_url = f"{self.base_url}/projects/{self.project_key}/ai-configs/{key}/variations"
        response = requests.post(var_url, headers=self.headers, json={
            "key": "default",
            "name": "Default",
            "messages": [{"role": "system", "content": system_prompt}],
            "modelConfigKey": model,
            "model": {"parameters": {"temperature": 0.3}}
        })

        if response.status_code in [200, 201]:
            print(f"[成功] 创建评判者: {key}")
            return response.json()
        print(f"[错误] 创建变体: {response.text}")
        return {}

    def set_fallthrough(self, config_key: str, environment: str,
                        variation_key: str = "default") -> bool:
        """
        设置默认回退以启用评判者配置。

        注意：turnTargetingOn不适用于AI Config。请将默认回退从disabled（索引0）改为启用的变体。
        """
        # 获取变体ID
        url = f"{self.base_url}/projects/{self.project_key}/ai-configs/{config_key}/targeting"
        response = requests.get(url, headers=self.headers)

        if response.status_code != 200:
            print(f"[错误] {response.status_code}: {response.text}")
            return False

        targeting = response.json()
        variation_id = None
        for var in targeting.get("variations", []):
            if var.get("key") == variation_key or var.get("name") == variation_key:
                variation_id = var.get("_id")
                break

        if not variation_id:
            print(f"[错误] 未找到变体'{variation_key}'")
            return False

        # 设置默认回退
        response = requests.patch(url, headers={
            **self.headers,
            "Content-Type": "application/json; domain-model=launchdarkly.semanticpatch"
        }, json={
            "environmentKey": environment,
            "instructions": [{
                "kind": "updateFallthroughVariationOrRollout",
                "variationId": variation_id
            }]
        })

        if response.status_code == 200:
            print(f"[成功] 已为{config_key}设置默认回退")
            return True
        print(f"[错误] {response.status_code}: {response.text}")
        return False

SDK: Automatic Evaluation

SDK：自动评估

When using

create_model()

run()

, attached judges evaluate automatically:

python

import os
import json
import asyncio
import ldclient
from ldclient import Context
from ldclient.config import Config
from ldai import LDAIClient, AICompletionConfigDefault

sdk_key = os.getenv('LAUNCHDARKLY_SDK_KEY')
ai_config_key = os.getenv('LAUNCHDARKLY_AI_CONFIG_KEY', 'sample-ai-config')

async def async_main():
    ldclient.set_config(Config(sdk_key))
    aiclient = LDAIClient(ldclient.get())

    context = (
        Context.builder('example-user-key')
        .kind('user')
        .name('Sandy')
        .build()
    )

    default_value = AICompletionConfigDefault(enabled=False)

    # create_model() initializes with judges from AI Config
    model = await aiclient.create_model(ai_config_key, context, default_value, {})

    if not model:
        print(f"AI configuration not enabled for: {ai_config_key}")
        return

    user_input = 'How can LaunchDarkly help me?'

    # run() automatically evaluates with attached judges
    result = await model.run(user_input)
    print("Response:", result.content)

    # Await evaluation results
    if result.evaluations and len(result.evaluations) > 0:
        eval_results = await asyncio.gather(*result.evaluations)
        results_to_display = [
            r.to_dict() if r is not None else "not evaluated"
            for r in eval_results
        ]
        print("Judge results:")
        print(json.dumps(results_to_display, indent=2, default=str))

    # Always flush events before closing — trailing events are at risk of being
    # lost otherwise, in short-lived scripts and long-running services alike.
    ldclient.get().flush()
    ldclient.get().close()

使用

create_model()

run()

时，附加的评判者会自动执行评估：

python

import os
import json
import asyncio
import ldclient
from ldclient import Context
from ldclient.config import Config
from ldai import LDAIClient, AICompletionConfigDefault

sdk_key = os.getenv('LAUNCHDARKLY_SDK_KEY')
ai_config_key = os.getenv('LAUNCHDARKLY_AI_CONFIG_KEY', 'sample-ai-config')

async def async_main():
    ldclient.set_config(Config(sdk_key))
    aiclient = LDAIClient(ldclient.get())

    context = (
        Context.builder('example-user-key')
        .kind('user')
        .name('Sandy')
        .build()
    )

    default_value = AICompletionConfigDefault(enabled=False)

    # create_model()会从AI Config初始化评判者
    model = await aiclient.create_model(ai_config_key, context, default_value, {})

    if not model:
        print(f"AI配置未启用：{ai_config_key}")
        return

    user_input = 'LaunchDarkly能如何帮助我？'

    # run()会通过附加的评判者自动执行评估
    result = await model.run(user_input)
    print("响应:", result.content)

    # 等待评估结果
    if result.evaluations and len(result.evaluations) > 0:
        eval_results = await asyncio.gather(*result.evaluations)
        results_to_display = [
            r.to_dict() if r is not None else "未评估"
            for r in eval_results
        ]
        print("评判者结果:")
        print(json.dumps(results_to_display, indent=2, default=str))

    # 关闭前务必刷新事件——否则在短脚本和长服务中，末尾事件都有丢失风险。
    ldclient.get().flush()
    ldclient.get().close()

SDK: Direct Judge Evaluation

SDK：直接评判者评估

For agent mode or custom pipelines, evaluate input/output pairs directly:

python

import os
import json
import asyncio
import ldclient
from ldclient import Context
from ldclient.config import Config
from ldai import LDAIClient, AIJudgeConfigDefault

sdk_key = os.getenv('LAUNCHDARKLY_SDK_KEY')
judge_key = os.getenv('LAUNCHDARKLY_AI_JUDGE_KEY', 'sample-ai-judge-accuracy')

async def async_main():
    ldclient.set_config(Config(sdk_key))
    aiclient = LDAIClient(ldclient.get())

    context = (
        Context.builder('example-user-key')
        .kind('user')
        .name('Sandy')
        .build()
    )

    judge_default_value = AIJudgeConfigDefault(enabled=False)

    # Get judge configuration from LaunchDarkly
    judge = aiclient.create_judge(judge_key, context, judge_default_value)

    if not judge:
        print(f"AI judge configuration not enabled for key: {judge_key}")
        return

    input_text = 'You are a helpful assistant. How can you help me?'
    output_text = 'I can answer any question you have.'

    # Evaluate the input/output pair — returns a JudgeResult.
    judge_result = await judge.evaluate(input_text, output_text)

    if not judge_result.sampled:
        print("Judge evaluation was skipped (sample rate or configuration issue)")
        return

    # Track the consolidated result on the AI Config tracker if needed:
    # tracker = ai_config.create_tracker()
    # tracker.track_judge_result(judge_result)

    print("Judge Result:")
    print(json.dumps(judge_result.to_dict(), default=str))

    # Always flush events before closing — trailing events are at risk of being
    # lost otherwise, in short-lived scripts and long-running services alike.
    ldclient.get().flush()
    ldclient.get().close()

Note: Direct evaluation does not automatically record metrics. Obtain a tracker via
ai_config.create_tracker()
/
aiConfig.createTracker()
and call
tracker.track_judge_result(result)
/
tracker.trackJudgeResult(result)
to record scores for the AI Config you're evaluating.

对于代理模式或自定义流水线，可直接评估输入/输出对：

python

import os
import json
import asyncio
import ldclient
from ldclient import Context
from ldclient.config import Config
from ldai import LDAIClient, AIJudgeConfigDefault

sdk_key = os.getenv('LAUNCHDARKLY_SDK_KEY')
judge_key = os.getenv('LAUNCHDARKLY_AI_JUDGE_KEY', 'sample-ai-judge-accuracy')

async def async_main():
    ldclient.set_config(Config(sdk_key))
    aiclient = LDAIClient(ldclient.get())

    context = (
        Context.builder('example-user-key')
        .kind('user')
        .name('Sandy')
        .build()
    )

    judge_default_value = AIJudgeConfigDefault(enabled=False)

    # 从LaunchDarkly获取评判者配置
    judge = aiclient.create_judge(judge_key, context, judge_default_value)

    if not judge:
        print(f"AI评判者配置未启用：{judge_key}")
        return

    input_text = '你是一个乐于助人的助手。你能帮我做什么？'
    output_text = '我可以回答你的任何问题。'

    # 评估输入/输出对——返回JudgeResult。
    judge_result = await judge.evaluate(input_text, output_text)

    if not judge_result.sampled:
        print("评判者评估已跳过（采样率或配置问题）")
        return

    # 若需要，在AI Config跟踪器上记录统一结果：
    # tracker = ai_config.create_tracker()
    # tracker.track_judge_result(judge_result)

    print("评判者结果:")
    print(json.dumps(judge_result.to_dict(), default=str))

    # 关闭前务必刷新事件——否则在短脚本和长服务中，末尾事件都有丢失风险。
    ldclient.get().flush()
    ldclient.get().close()

注意：直接评估不会自动记录指标。通过
ai_config.create_tracker()
/
aiConfig.createTracker()
获取跟踪器，并调用
tracker.track_judge_result(result)
/
tracker.trackJudgeResult(result)
来记录你正在评估的AI Config的分数。

Sampling Rates

采样率

Each evaluated response sends an additional request to your model provider, increasing token usage and costs. Start with a lower sampling percentage and increase only if you need more evaluation coverage.

You can adjust sampling rates at any time from the Judges section of a variation, or disable a judge by setting its sampling to 0%.

每个被评估的响应都会向模型提供商发送额外请求，增加令牌使用量和成本。建议从较低的采样百分比开始，仅在需要更多评估覆盖范围时再提高。

你可以随时在变体的评判者部分调整采样率，或通过将采样率设置为0%来禁用评判者。

Viewing Results

查看结果

Navigate to AI Configs > select your config
Click Monitoring tab
Select Evaluator metrics from dropdown
View scores by variation and time range

Results appear within 1-2 minutes of evaluation.

导航至AI Configs > 选择你的配置
点击监控标签页
从下拉菜单中选择评估器指标
按变体和时间范围查看分数

评估完成后1-2分钟内即可查看结果。

Use in Guardrails and Experiments

在护栏和实验中的应用

Evaluation metrics integrate with:

Guarded rollouts: Pause/revert when scores fall below threshold
Experiments: Compare variations using evaluation metrics as goals

评估指标可与以下功能集成：

受控发布：当分数低于阈值时暂停/回滚
实验：使用评估指标作为目标比较变体

Error Handling

错误处理

Status	Cause	Solution
404	Config/variation not found	Verify keys exist
400	Invalid judge config	Check judgeConfigKey exists
403	Insufficient permissions	Check API token permissions
422	Duplicate metric key	Cannot attach multiple judges with same metric key

状态码	原因	解决方案
404	配置/变体不存在	验证键是否存在
400	评判者配置无效	检查judgeConfigKey是否存在
403	权限不足	检查API令牌权限
422	重复指标键	不能将多个具有相同指标键的评判者附加到同一变体

Next Steps

后续步骤

After attaching judges:

Set fallthrough on judge configs to an enabled variation (required)
Monitor results in Monitoring tab
Adjust sampling based on cost/coverage needs
Set up guarded rollouts for automatic regression detection

附加评判者后：

设置默认回退：将评判者配置设置为启用的变体（必填）
监控结果：在监控标签页查看
调整采样率：根据成本/覆盖需求调整
设置受控发布：实现自动回归检测

Related Skills

References

参考资料

Python SDK examples:

create_judge_example.py - Evaluate input/output pairs directly via
```
create_judge
```
+
```
evaluate
```
create_model_example.py - Automatic evaluation with
```
create_model
```
+
```
run
```
(attached judges fire during the run)

Node.js SDK examples:

features/create-judge - Evaluate input/output pairs directly via
```
createJudge
```
+
```
evaluate
```
features/create-model - Automatic evaluation with
```
createModel
```
+
```
run
```
(attached judges fire during the run)

Python SDK示例：

create_judge_example.py - 通过
```
create_judge
```
+
```
evaluate
```
直接评估输入/输出对
create_model_example.py - 使用
```
create_model
```
+
```
run
```
实现自动评估（附加的评判者会在运行期间触发）

Node.js SDK示例：

features/create-judge - 通过
```
createJudge
```
+
```
evaluate
```
直接评估输入/输出对
features/create-model - 使用
```
createModel
```
+
```
run
```
实现自动评估（附加的评判者会在运行期间触发）

online-evals

Original

Translation

AI Config Online Evaluations

AI Config在线评估

Prerequisites

前提条件

API Key Detection

API密钥检测

Core Concepts

核心概念

What Are Judges?

什么是评判者？

Built-in Judges

内置评判者

Completion Mode Only

仅支持完成模式

Restrictions

限制条件

Workflow

工作流程

Step 1: Create Custom Judges (Optional)

步骤1：创建自定义评判者（可选）

Create judge config

创建评判者配置

Step 2: Attach Judges to Variations

步骤2：将评判者附加到变体

Step 3: Set Fallthrough on Judges

步骤3：为评判者设置默认回退

First get the variation ID for "Default" from GET targeting response

首先从GET targeting响应中获取"Default"的变体ID

Python Implementation

Python实现

SDK: Automatic Evaluation

SDK：自动评估

SDK: Direct Judge Evaluation

SDK：直接评判者评估

Sampling Rates

采样率

Viewing Results

查看结果

Use in Guardrails and Experiments

在护栏和实验中的应用

Error Handling

错误处理

Next Steps

后续步骤

Related Skills

相关技能

References

参考资料