llm-obs-eval-bootstrap

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Backend

后端

Detection — At the start of every invocation, before taking any action, determine which backend to use:

If the user passed
```
--backend pup
```
anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
Check whether MCP tools are present in your active tool list. The canonical signal is whether
```
mcp__datadog-llmo-mcp__list_llmobs_evals
```
appears in your available tools.
If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in this skill's workflow sections.
If MCP tools are absent → check whether
```
pup
```
is executable: run
```
pup --version
```
via Bash. A JSON response containing
```
"version"
```
confirms pup is available.
If pup responds → use pup mode throughout. Translate every MCP tool call to its pup equivalent using the Tool Reference appendix at the bottom of this file.
If neither is available → stop and tell the user:
"Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server (
```
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
```
) or install pup."

--backend pup

is accepted anywhere in the invocation arguments and is stripped before passing remaining args to the skill logic.

pup invocation rules:

Invoke via Bash:
```
pup llm-obs <subcommand> [flags]
```
pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results, which may wrap JSON in
```
[{"type": "text", "text": "<json>"}]
```
).
If pup returns an auth error, tell the user to run
```
pup auth login
```
and stop.
Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
Time flags: pup accepts bare duration strings (
```
1h
```
,
```
7d
```
,
```
30m
```
) and RFC3339 timestamps. Do not use
```
now-
```
-prefixed strings — strip the prefix when converting from a skill
```
--timeframe
```
argument:
```
now-7d
```
→
```
7d
```
,
```
now-24h
```
→
```
24h
```
,
```
now-30d
```
→
```
30d
```
.
```
--summary
```
on
```
pup llm-obs spans search
```
strips payload fields to essential metadata only. Use it in bulk/search phases where content is not needed.

Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g.,

3a9f1c2b

). Keep it constant for the entire invocation.

Intent tagging: On every MCP tool call, prefix

telemetry.intent

with

skill:llm-obs-eval-bootstrap[<inv_id>] —

followed by a description of why the tool is being called. On the first MCP tool call only, use

skill:llm-obs-eval-bootstrap:start[<inv_id>] —

instead (note the

:start

suffix). Example first call:

skill:llm-obs-eval-bootstrap:start[3a9f1c2b] — Phase 0: map existing eval coverage for task-cruncher

检测逻辑 — 在每次调用开始、执行任何操作前，确定要使用的后端：

如果用户在调用中任何位置传入
```
--backend pup
```
→ 立即使用pup模式，无论是否存在MCP工具。跳过步骤2-4。
检查活跃工具列表中是否存在MCP工具。标准判断信号是可用工具中是否包含
```
mcp__datadog-llmo-mcp__list_llmobs_evals
```
。
如果存在MCP工具 → 全程使用MCP模式。严格按照本技能工作流章节中指定的名称调用MCP工具。
如果不存在MCP工具 → 检查
```
pup
```
是否可执行：通过Bash运行
```
pup --version
```
。返回包含
```
"version"
```
的JSON响应即确认pup可用。
如果pup响应正常 → 全程使用pup模式。使用本文件底部的工具参考附录，将每个MCP工具调用转换为对应的pup等效命令。
如果两者都不可用 → 停止操作并告知用户：
"Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器（
```
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
```
）或安装pup。"

--backend pup

可在调用参数的任何位置使用，在将剩余参数传递给技能逻辑前会被剥离。

pup调用规则：

通过Bash调用：
```
pup llm-obs <subcommand> [flags]
```
pup始终输出JSON。直接解析即可——无需解包内容块（与MCP结果不同，MCP结果可能将JSON包裹在
```
[{"type": "text", "text": "<json>"}]
```
中）。
如果pup返回认证错误，告知用户运行
```
pup auth login
```
并停止操作。
并行化：在单个消息中发起多个Bash工具调用（每个pup命令对应一次调用）。
时间参数：pup接受纯时长字符串（
```
1h
```
、
```
7d
```
、
```
30m
```
）和RFC3339时间戳。不要使用
```
now-
```
前缀的字符串——转换技能的
```
--timeframe
```
参数时需移除前缀：
```
now-7d
```
→
```
7d
```
，
```
now-24h
```
→
```
24h
```
，
```
now-30d
```
→
```
30d
```
。
在
```
pup llm-obs spans search
```
中使用
```
--summary
```
会将负载字段精简为核心元数据。在批量/搜索阶段不需要内容时使用该参数。

调用ID：在每次调用的最开始、发起任何MCP工具调用前，生成一个8字符的十六进制调用ID（例如

3a9f1c2b

）。整个调用过程中保持该ID不变。

意图标记：在每个MCP工具调用中，将

telemetry.intent

前缀设置为

skill:llm-obs-eval-bootstrap[<inv_id>] —

，后跟调用该工具的原因描述。仅在第一次MCP工具调用时，使用

skill:llm-obs-eval-bootstrap:start[<inv_id>] —

（注意

:start

后缀）。示例首次调用：

skill:llm-obs-eval-bootstrap:start[3a9f1c2b] — Phase 0: map existing eval coverage for task-cruncher

Eval Bootstrap — Generate Evaluators from Production Traces

评估器引导——从生产Trace生成评估器

Given a sample of production LLM traces, analyze input/output patterns and quality dimensions, then emit a ready-to-use evaluator suite. Three output modes:

sdk_code
(default) — Python
```
.py
```
file using the Datadog Evals SDK (
```
BaseEvaluator
```
/
```
LLMJudge
```
) for offline experiments.
data_only
— self-contained JSON spec, framework-agnostic.
publish
— write online LLM-judge evaluators directly to Datadog via
```
create_or_update_llmobs_evaluator
```
. These run automatically on matching production spans or traces (no dataset, no task function). The skill auto-classifies each proposed evaluator as span-scoped or trace-scoped based on what the judgment requires (a per-LLM-call tone check vs. an agent goal completion that needs the whole trace) — the user accepts or overrides the classification at the proposal checkpoint.

基于生产LLM Trace样本，分析输入/输出模式和质量维度，然后生成可直接使用的评估器套件。支持三种输出模式：

sdk_code
（默认）——使用Datadog Evals SDK（
```
BaseEvaluator
```
/
```
LLMJudge
```
）生成Python
```
.py
```
文件，用于离线实验。
data_only
——生成独立的JSON规范，与框架无关。
publish
——通过
```
create_or_update_llmobs_evaluator
```
直接将在线LLM-judge评估器写入Datadog。这些评估器会自动在匹配的生产Span或Trace上运行（无需数据集、无需任务函数）。技能会根据判断需求自动将每个拟议评估器分类为Span范围或Trace范围（例如，每个LLM调用的语气检查需要Span范围，而代理目标完成需要整个Trace则需要Trace范围）——用户会在提案检查点接受或覆盖该分类。

Usage

使用方法

/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only | --publish]

Arguments: $ARGUMENTS

/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only | --publish]

参数：$ARGUMENTS

Inputs

输入项

Input	Required	Default	Description
`ml_app`	Yes	—	ML application to scope traces
`timeframe`	No	`now-7d`	How far back to look
`rca_report`	No	—	Failure taxonomy from `eval-trace-rca` skill, or a free-text failure hypothesis
`--data-only`	No	off	Emit a self-contained JSON spec file instead of Python SDK code
`--publish`	No	off	Publish online LLM-judge evaluators to Datadog (mutually exclusive with `--data-only` )

ml_app

is missing, ask the user before proceeding. If both

--data-only

and

--publish

are supplied, error out and ask which mode the user wants.

输入项	是否必填	默认值	描述
`ml_app`	是	—	用于限定Trace范围的ML应用
`timeframe`	否	`now-7d`	回溯时间范围
`rca_report`	否	—	来自 `eval-trace-rca` 技能的故障分类，或自由文本形式的失败假设
`--data-only`	否	关闭	生成独立的JSON规范文件，而非Python SDK代码
`--publish`	否	关闭	将在线LLM-judge评估器发布到Datadog（与 `--data-only` 互斥）

如果缺少

ml_app

，在继续前询问用户。如果同时提供

--data-only

和

--publish

，抛出错误并询问用户想要使用哪种模式。

Available Tools

可用工具

Tool	Purpose
`search_llmobs_spans`	Find spans by eval presence, tags, span kind, query syntax. Paginate with cursor.
`get_llmobs_span_details`	Metadata, evaluations (scores, labels, reasoning), and `content_info` map showing available fields + sizes.
`get_llmobs_span_content`	Actual content for a span field. Supports JSONPath via `path` param for targeted extraction.
`get_llmobs_trace`	Full trace hierarchy as span tree with span counts by kind.
`get_llmobs_agent_loop`	Chronological agent execution timeline (LLM calls, tool invocations, decisions).
`list_llmobs_evals`	List every evaluator configured for the caller's org across all ml_apps, with `enabled` status and `ml_app` per result. Call once in Phase 0 to map existing coverage before proposing new evaluators — filter the result by `ml_app` client-side.
`get_llmobs_evaluator`	Fetch the full persisted evaluator config by name (target ml_app + sampling + filter, provider, prompt template, parsing type, output schema, assessment criteria). Use in Phase 0 to understand what each existing custom eval measures, and (in publish mode) before any update — `create_or_update_llmobs_evaluator` is full-replace, so you must round-trip the full config to avoid clobbering fields. Not all evaluators have a stored config (notably `source=ootb` ); a not-found error there is expected — skip those.
`create_or_update_llmobs_evaluator`	(publish mode) Write an LLM-judge evaluator config to Datadog. Full-replace semantics: any omitted optional field resets to its default. See "Publishing Conventions" for required fields and structured output → JSON schema mapping.
`delete_llmobs_evaluator`	(publish mode) Only used if the user explicitly asks to remove an evaluator. Never invoke speculatively.

工具	用途
`search_llmobs_spans`	根据评估存在性、标签、Span类型、查询语法查找Span。使用游标分页。
`get_llmobs_span_details`	获取元数据、评估结果（分数、标签、推理过程），以及显示可用字段和大小的 `content_info` 映射。
`get_llmobs_span_content`	获取Span字段的实际内容。支持通过 `path` 参数使用JSONPath进行定向提取。
`get_llmobs_trace`	获取完整的Trace层级结构，即包含各类型Span计数的Span树。
`get_llmobs_agent_loop`	获取按时间顺序排列的代理执行时间线（LLM调用、工具调用、决策）。
`list_llmobs_evals`	列出调用者组织下所有ml_app中配置的所有评估器，包含 `enabled` 状态和每个结果对应的 `ml_app` 。在阶段0调用一次，以在提出新评估器前映射现有覆盖范围——在客户端按 `ml_app` 过滤结果。
`get_llmobs_evaluator`	通过名称获取完整的持久化评估器配置（目标ml_app + 采样 + 过滤、提供商、提示模板、解析类型、输出 schema、评估标准）。在阶段0使用，以了解每个现有自定义评估器的测量内容；在发布模式下，任何更新前都要使用该工具—— `create_or_update_llmobs_evaluator` 是全替换模式，因此必须往返完整配置以避免覆盖字段。并非所有评估器都有存储的配置（尤其是 `source=ootb` 的评估器）；出现未找到错误是预期情况——跳过这些评估器。
`create_or_update_llmobs_evaluator`	（发布模式）将LLM-judge评估器配置写入Datadog。全替换语义：任何省略的可选字段都会重置为默认值。有关必填字段和结构化输出→JSON schema映射，请参阅“发布约定”。
`delete_llmobs_evaluator`	（发布模式）仅在用户明确要求移除评估器时使用。切勿推测性调用。

Key

get_llmobs_span_content

Patterns

get_llmobs_span_content

关键使用模式

Use the

path

parameter to extract targeted data without fetching full payloads:

Field	Path	What you get
`messages`	`$.messages[0]`	System prompt (first message, usually `system` role)
`messages`	`$.messages[-1]`	Last assistant response
`messages`	(no path)	Full conversation including tool calls
`input` / `output`	—	Span I/O
`documents`	—	Retrieved documents (RAG apps)
`metadata`	—	Custom metadata (prompt versions, feature flags, user segments)

使用

path

参数提取目标数据，无需获取完整负载：

字段	路径	获取内容
`messages`	`$.messages[0]`	系统提示（第一条消息，通常为 `system` 角色）
`messages`	`$.messages[-1]`	最后一条助手响应
`messages`	（无路径）	包含工具调用的完整对话
`input` / `output`	—	Span输入/输出
`documents`	—	检索到的文档（RAG应用）
`metadata`	—	自定义元数据（提示版本、功能标志、用户细分）

How to Use

search_llmobs_spans

search_llmobs_spans

使用方法

Additional filters combine with space (AND):

@status:error @ml_app:my-app

. Dedicated params (

span_kind

root_spans_only

ml_app

) work alongside

query

, but

query

takes precedence over

tags

To find spans with a specific eval:

@evaluations.custom.<eval_name>:*

— you can only query for eval presence, not specific results.

附加过滤器使用空格组合（AND逻辑）：

@status:error @ml_app:my-app

。专用参数（

span_kind

、

root_spans_only

、

ml_app

）可与

query

配合使用，但

query

优先级高于

tags

。

查找包含特定评估器的Span：

@evaluations.custom.<eval_name>:*

——只能查询评估器的存在性，无法查询特定结果。

Parallelization Rules

并行化规则

get_llmobs_span_details
: Group span_ids by trace_id. One call per trace_id with ALL its span_ids. Issue ALL calls for a page in a single message.
get_llmobs_span_content
: Each call is independent — always issue ALL in a single message.
get_llmobs_trace
/
get_llmobs_agent_loop
: Parallelize across different traces in a single message.
Pipeline parallelism: Start
```
get_llmobs_span_details
```
for page 1 results immediately — don't wait to collect all pages.

get_llmobs_span_details
：按trace_id对span_ids进行分组。每个trace_id对应一次调用，包含其所有span_ids。在单个消息中发起某一页的所有调用。
get_llmobs_span_content
：每次调用相互独立——始终在单个消息中发起所有调用。
get_llmobs_trace
/
get_llmobs_agent_loop
：在单个消息中对不同Trace进行并行调用。
流水线并行化：立即为第1页结果发起
```
get_llmobs_span_details
```
调用——无需等待收集所有页面。

Evaluator SDK Reference

评估器SDK参考

Applies to
sdk_code
mode only. In
data_only
mode, use this section as domain context when writing rubric prompts — no SDK classes are emitted.

仅适用于
sdk_code
模式。在
data_only
模式下，将本节作为领域上下文用于编写评估准则提示——不会生成SDK类。

Imports

导入

python

undefined

python

undefined

Core classes

核心类

from ddtrace.llmobs._experiment import BaseEvaluator, EvaluatorContext, EvaluatorResult

LLM-as-judge

LLM作为Judge

from ddtrace.llmobs._evaluators.llm_judge import ( LLMJudge, BooleanStructuredOutput, ScoreStructuredOutput, CategoricalStructuredOutput, )

Built-in evaluators (use only if needed)

内置评估器（仅在需要时使用）

from ddtrace.llmobs._evaluators.format import JSONEvaluator, LengthEvaluator from ddtrace.llmobs._evaluators.string_matching import StringCheckEvaluator, RegexMatchEvaluator


Only import what the generated file actually uses.

from ddtrace.llmobs._evaluators.format import JSONEvaluator, LengthEvaluator from ddtrace.llmobs._evaluators.string_matching import StringCheckEvaluator, RegexMatchEvaluator


仅导入生成文件实际使用的类。

EvaluatorContext (what

evaluate()

receives)

EvaluatorContext（

evaluate()

接收的参数）

python

@dataclass(frozen=True)
class EvaluatorContext:
    input_data: dict[str, Any]          # Task inputs (from dataset record, NOT from span)
    output_data: Any                     # Task output (from task function return, NOT from span)
    expected_output: Optional[JSONType] = None  # Ground truth (if available)
    metadata: dict[str, Any] = {}        # Additional metadata
    span_id: Optional[str] = None        # LLMObs span ID
    trace_id: Optional[str] = None       # LLMObs trace ID

Important — span data vs evaluator data: When exploring production traces, you see span I/O (e.g.,

input.value

output.messages

). But evaluators run in offline experiments where

input_data

and

output_data

come from the user's dataset records and task function, not from spans. The dataset schema is user-defined and may not match span structure. Write evaluator prompts with generic

{{input_data}}

{{output_data}}

placeholders and add comments describing what data the evaluator was designed for, so the user can adapt to their dataset shape.

python

@dataclass(frozen=True)
class EvaluatorContext:
    input_data: dict[str, Any]          # 任务输入（来自数据集记录，而非Span）
    output_data: Any                     # 任务输出（来自任务函数返回值，而非Span）
    expected_output: Optional[JSONType] = None  # 基准真值（如果可用）
    metadata: dict[str, Any] = {}        # 附加元数据
    span_id: Optional[str] = None        # LLMObs Span ID
    trace_id: Optional[str] = None       # LLMObs Trace ID

重要——Span数据与评估器数据的区别：在探索生产Trace时，看到的是Span输入/输出（例如

input.value

、

output.messages

）。但评估器在离线实验中运行，

input_data

和

output_data

来自用户的数据集记录和任务函数，而非Span。数据集schema由用户定义，可能与Span结构不匹配。在评估器提示中使用通用的

{{input_data}}

{{output_data}}

占位符，并添加注释说明评估器设计用于何种数据，以便用户适配其数据集结构。

EvaluatorResult (what

evaluate()

returns)

EvaluatorResult（

evaluate()

返回的结果）

python

EvaluatorResult(
    value=...,                    # Required. JSONType (str, int, float, bool, None, list, dict)
    reasoning="...",              # Optional. Explanation string
    assessment="pass" or "fail",  # Optional. Pass/fail assessment
    metadata={...},              # Optional. Evaluation metadata dict
    tags={...},                  # Optional. Tags dict
)

python

EvaluatorResult(
    value=...,                    # 必填。JSONType（字符串、整数、浮点数、布尔值、None、列表、字典）
    reasoning="...",              # 可选。解释字符串
    assessment="pass" or "fail",  # 可选。通过/失败评估结果
    metadata={...},              # 可选。评估元数据字典
    tags={...},                  # 可选。标签字典
)

LLMJudge — LLM-as-Judge Evaluator

LLMJudge——LLM作为Judge的评估器

python

judge = LLMJudge(
    user_prompt="...",              # Required. Supports {{template_vars}}
    system_prompt="...",            # Optional. Does NOT support template vars
    structured_output=...,          # Optional. Boolean/Score/Categorical output, or a dict for custom JSON schema
    provider="openai",              # "openai" | "anthropic" | "azure_openai" | "vertexai" | "bedrock"
    model="gpt-4o",                # Model identifier
    model_params={"temperature": 0.0},  # Optional. Passed to LLM API
    name="eval_name",              # Optional. Must match ^[a-zA-Z0-9_-]+$
)

Template variables in

user_prompt

{{input_data}}

{{output_data}}

{{expected_output}}

{{metadata.key}}

— resolved from

EvaluatorContext

fields via dot-path into nested dicts.

python

judge = LLMJudge(
    user_prompt="...",              # 必填。支持{{template_vars}}
    system_prompt="...",            # 可选。不支持模板变量
    structured_output=...,          # 可选。布尔值/分数/分类输出，或自定义JSON schema的字典
    provider="openai",              # "openai" | "anthropic" | "azure_openai" | "vertexai" | "bedrock"
    model="gpt-4o",                # 模型标识符
    model_params={"temperature": 0.0},  # 可选。传递给LLM API的参数
    name="eval_name",              # 可选。必须匹配^[a-zA-Z0-9_-]+$
)

user_prompt
中的模板变量：

{{input_data}}

、

{{output_data}}

、

{{expected_output}}

、

{{metadata.key}}

——通过点路径从

EvaluatorContext

字段解析嵌套字典。

Structured Output Types

结构化输出类型

Boolean — true/false with optional pass/fail:

python

BooleanStructuredOutput(
    description="Whether the response is factually accurate",
    reasoning=True,                    # Include reasoning field in LLM response
    reasoning_description=None,        # Optional custom description for reasoning field
    pass_when=True,                    # True → pass when true, False → pass when false, None → no assessment
)

Score — numeric within a range with optional thresholds:

python

ScoreStructuredOutput(
    description="Helpfulness score",
    min_score=1,                       # Minimum possible score
    max_score=10,                      # Maximum possible score
    reasoning=True,
    reasoning_description=None,
    min_threshold=7,                   # Scores >= 7 pass (optional)
    max_threshold=None,                # Scores <= N pass (optional)
)

Categorical — select from predefined categories:

python

CategoricalStructuredOutput(
    categories={
        "correct": "The response correctly answers the question",
        "partially_correct": "The response is partially correct but missing key information",
        "incorrect": "The response is factually wrong or irrelevant",
    },
    reasoning=True,
    reasoning_description=None,
    pass_values=["correct"],           # Which categories count as passing (optional)
)

Custom JSON schema — arbitrary structured responses for multi-dimensional evals:

python

undefined

布尔值——真/假，可选通过/失败标记：

python

BooleanStructuredOutput(
    description="Whether the response is factually accurate",
    reasoning=True,                    # 在LLM响应中包含推理字段
    reasoning_description=None,        # 推理字段的可选自定义描述
    pass_when=True,                    # True→为真时通过，False→为假时通过，None→无评估结果
)

分数——指定范围内的数值，可选阈值：

python

ScoreStructuredOutput(
    description="Helpfulness score",
    min_score=1,                       # 最小可能分数
    max_score=10,                      # 最大可能分数
    reasoning=True,
    reasoning_description=None,
    min_threshold=7,                   # 分数≥7时通过（可选）
    max_threshold=None,                # 分数≤N时通过（可选）
)

分类——从预定义类别中选择：

python

CategoricalStructuredOutput(
    categories={
        "correct": "The response correctly answers the question",
        "partially_correct": "The response is partially correct but missing key information",
        "incorrect": "The response is factually wrong or irrelevant",
    },
    reasoning=True,
    reasoning_description=None,
    pass_values=["correct"],           # 哪些类别算作通过（可选）
)

自定义JSON schema——用于多维评估的任意结构化响应：

python

undefined

Pass a raw dict as structured_output — used as the JSON schema directly

传递原始字典作为structured_output——直接用作JSON schema

structured_output={ "type": "object", "properties": { "relevance": {"type": "boolean", "description": "Whether the response addresses the question"}, "confidence": {"type": "number", "description": "Confidence score (0.0 to 1.0)"}, "reasoning": {"type": "string", "description": "Explanation for the evaluation"}, }, "required": ["relevance", "confidence", "reasoning"], "additionalProperties": False, }


Always write standard JSON schema — the SDK adapts it per provider automatically (e.g., Anthropic doesn't support `minimum`/`maximum` on number fields, so the SDK moves range constraints into the `description`; Vertex AI converts `const`/`anyOf` to `enum`). The full parsed JSON dict becomes the eval `value`; a `"reasoning"` key (if present) is automatically extracted. No automatic pass/fail assessment.


始终编写标准JSON schema——SDK会自动根据提供商进行适配（例如，Anthropic不支持数字字段的`minimum`/`maximum`，因此SDK会将范围约束移至`description`中；Vertex AI将`const`/`anyOf`转换为`enum`）。完整解析后的JSON字典成为评估的`value`；如果存在`"reasoning"`键，会自动提取。不会自动生成通过/失败评估结果。

LLMJudge Prompt Guidelines

LLMJudge提示准则

The

structured_output

parameter enforces the response format via JSON schema. Do not prescribe the format in the prompt (no "Answer YES/NO", "Rate 1-10", etc.). Instead, describe the evaluation criteria and let the structured output handle the format.

system_prompt: Set the judge's role and the app's domain context. Does NOT support template vars.
user_prompt: Present the data via
```
{{input_data}}
```
/
```
{{output_data}}
```
, then describe what good vs. bad looks like for this dimension.

structured_output

参数通过JSON schema强制响应格式。不要在提示中指定格式（例如“回答YES/NO”、“评分1-10”等）。相反，描述评估标准，让结构化输出处理格式问题。

system_prompt：设置Judge的角色和应用的领域上下文。不支持模板变量。
user_prompt：通过
```
{{input_data}}
```
/
```
{{output_data}}
```
呈现数据，然后描述该维度下好与坏的表现。

BaseEvaluator — Custom Code-Based Evaluator

BaseEvaluator——基于自定义代码的评估器

For deterministic checks that do not need LLM judgment:

python

class MyEvaluator(BaseEvaluator):
    def __init__(self, name=None, ...custom_params...):
        super().__init__(name=name)
        self._param = ...  # Store config as private attrs

    def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
        # Access: context.input_data, context.output_data, context.expected_output, context.metadata
        # Must NOT modify self attributes (thread safety)
        passed = ...  # Your logic here
        return EvaluatorResult(
            value=passed,
            reasoning="...",
            assessment="pass" if passed else "fail",
        )

用于无需LLM判断的确定性检查：

python

class MyEvaluator(BaseEvaluator):
    def __init__(self, name=None, ...custom_params...):
        super().__init__(name=name)
        self._param = ...  # 将配置存储为私有属性

    def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
        # 访问：context.input_data, context.output_data, context.expected_output, context.metadata
        # 不得修改self属性（线程安全）
        passed = ...  # 此处编写你的逻辑
        return EvaluatorResult(
            value=passed,
            reasoning="...",
            assessment="pass" if passed else "fail",
        )

Built-in Evaluators

内置评估器

python

undefined

python

undefined

Validate JSON syntax + optional required keys

验证JSON语法 + 可选必填键

JSONEvaluator(required_keys=["name", "age"], output_extractor=None, name=None)

Validate length (characters, words, or lines)

验证长度（字符、单词或行数）

LengthEvaluator(count_by="words", min_length=10, max_length=500, output_extractor=None, name=None)

count_by: "characters" | "words" | "lines"

String matching

字符串匹配

StringCheckEvaluator(operation="contains", expected="success", case_sensitive=False, name=None)

operation: "eq" | "ne" | "contains" | "icontains"

Regex matching

正则匹配

RegexMatchEvaluator(pattern=r"\d{4}-\d{2}-\d{2}", match_mode="search", name=None)

match_mode: "search" | "match" | "fullmatch"

undefined

undefined

Evaluator Type Decision Matrix

评估器类型决策矩阵

Signal	Evaluator Type
Output must be valid JSON	`JSONEvaluator`
Output must match a regex pattern	`RegexMatchEvaluator`
Output has length constraints	`LengthEvaluator`
Output must contain/not contain specific strings	`StringCheckEvaluator`
Semantic quality judgment (tone, accuracy, completeness)	`LLMJudge` + `BooleanStructuredOutput`
Graded quality on a scale	`LLMJudge` + `ScoreStructuredOutput`
Classification into categories	`LLMJudge` + `CategoricalStructuredOutput`
Multi-dimensional judgment (evaluate several aspects at once)	`LLMJudge` + custom JSON schema `dict`
Complex domain logic combining multiple checks	`BaseEvaluator` subclass

信号	评估器类型
输出必须是有效的JSON	`JSONEvaluator`
输出必须匹配正则模式	`RegexMatchEvaluator`
输出有长度限制	`LengthEvaluator`
输出必须包含/不包含特定字符串	`StringCheckEvaluator`
语义质量判断（语气、准确性、完整性）	`LLMJudge` + `BooleanStructuredOutput`
按比例评分的质量	`LLMJudge` + `ScoreStructuredOutput`
分类到类别中	`LLMJudge` + `CategoricalStructuredOutput`
多维判断（同时评估多个方面）	`LLMJudge` + 自定义JSON schema `dict`
结合多个检查的复杂领域逻辑	`BaseEvaluator` 子类

Source Verification

源码验证

If you have access to dd-trace-py locally, verify the API surface by reading the corresponding modules:

ddtrace.llmobs._evaluators.llm_judge

—

LLMJudge

BooleanStructuredOutput

ScoreStructuredOutput

CategoricalStructuredOutput

ddtrace.llmobs._experiment

—

BaseEvaluator

EvaluatorContext

EvaluatorResult

ddtrace.llmobs._evaluators.format

—

JSONEvaluator

LengthEvaluator

ddtrace.llmobs._evaluators.string_matching

—

StringCheckEvaluator

RegexMatchEvaluator

如果本地可以访问dd-trace-py，通过阅读相应模块验证API接口：

ddtrace.llmobs._evaluators.llm_judge

—

LLMJudge

、

BooleanStructuredOutput

、

ScoreStructuredOutput

、

CategoricalStructuredOutput

ddtrace.llmobs._experiment

—

BaseEvaluator

、

EvaluatorContext

、

EvaluatorResult

ddtrace.llmobs._evaluators.format

—

JSONEvaluator

、

LengthEvaluator

ddtrace.llmobs._evaluators.string_matching

—

StringCheckEvaluator

、

RegexMatchEvaluator

Workflow

工作流

Phase 0: Resolve Inputs & Entry Mode

阶段0：解析输入与确定进入模式

Entry mode detection:

Mode	Signal	Behavior
Cold Start	Only `ml_app` provided (no RCA, no hypothesis)	Full open discovery — understand what the app does, identify quality dimensions worth measuring, propose evals for coverage
From RCA	Conversation contains an RCA report or user provides a failure hypothesis	Skip open discovery — use existing failure taxonomy as eval targets

Parse arguments: Extract

ml_app

(first non-flag argument),

--timeframe

(default

now-7d

--data-only

, and

--publish

flags. Set

output_mode = publish

--publish

is set,

output_mode = data_only

--data-only

is set, otherwise

output_mode = sdk_code

. Error if both

--data-only

and

--publish

are present.

Resolution steps:

If
```
ml_app
```
not provided → ask the user.
Auto-detect entry mode:
- If the conversation contains an RCA report (look for "Failure Taxonomy" heading, structured failure modes, or severity ratings) →
```
from_rca
```
  . Extract the taxonomy.
- If the user provides a free-text failure hypothesis (e.g., "the system prompt lacks grounding") →
```
from_rca
```
  . Use the hypothesis as the starting eval target.
- Otherwise →
```
cold_start
```
  .
If
```
timeframe
```
not provided → default to
```
now-7d
```
.
Map existing eval coverage — skip if
output_mode = data_only
(there is no Datadog eval project to check coverage against): Call
```
list_llmobs_evals
```
(org-wide; filter the result client-side to entries where
```
ml_app == <ml_app>
```
). Then, for each eval with
```
source=custom
```
, call
```
get_llmobs_evaluator(eval_name=...)
```
to inspect its prompt template, target, sampling, and filter, and infer which quality dimension it covers. Issue all evaluator calls in a single message (parallelize). Skip
```
source=ootb
```
evals — their names are self-describing and they may not have a fetchable config.
By the end of this step you have a complete coverage map:
```
{eval_name → source, enabled, dimension}
```
. Carry this into Phase 2 for deduplication.
In
publish
mode, also note any template-variable convention the existing custom evaluators already use (so a new suite reads consistently). Online evaluator templates resolve against the full span JSON, not against
```
EvaluatorContext
```
. See the "Online Template Variables" section under "Publishing Conventions" for the supported syntax (
```
{{span_input}}
```
,
```
{{span_output}}
```
, dot-paths, array selectors, filter accessors).
Notebook context detection: Scan the current conversation for a Datadog notebook URL that was produced by
```
/eval-trace-rca
```
(pattern:
```
https://app.datadoghq.com/notebook/{numeric-id}
```
). If found, store it as
```
rca_notebook_url
```
and extract the numeric ID as
```
rca_notebook_id
```
. This is used after Phase 3 to offer appending the evaluator suite to that notebook instead of creating a new one.

进入模式检测：

模式	信号	行为
冷启动	仅提供 `ml_app` （无RCA、无假设）	全面开放探索——了解应用功能，确定值得测量的质量维度，提出评估器以覆盖这些维度
来自RCA	对话包含RCA报告或用户提供失败假设	跳过开放探索——使用现有故障分类作为评估目标

解析参数：提取

ml_app

（第一个非标志参数）、

--timeframe

（默认

now-7d

）、

--data-only

和

--publish

标志。如果设置

--publish

，则

output_mode = publish

；如果设置

--data-only

，则

output_mode = data_only

；否则

output_mode = sdk_code

。如果同时提供

--data-only

和

--publish

，抛出错误。

解析步骤：

如果未提供
```
ml_app
```
→ 询问用户。
自动检测进入模式：
- 如果对话包含RCA报告（查找“Failure Taxonomy”标题、结构化故障模式或严重性评级）→
```
from_rca
```
  。提取分类信息。
- 如果用户提供自由文本形式的失败假设（例如“系统提示缺乏基础信息”）→
```
from_rca
```
  。将该假设作为初始评估目标。
- 否则 →
```
cold_start
```
  。
如果未提供
```
timeframe
```
→ 默认使用
```
now-7d
```
。
映射现有评估覆盖范围 — 如果
output_mode = data_only
则跳过（无Datadog评估项目可检查覆盖范围）：调用
```
list_llmobs_evals
```
（全组织范围；在客户端过滤
```
ml_app == <ml_app>
```
的条目）。然后，对于每个
```
source=custom
```
的评估器，调用
```
get_llmobs_evaluator(eval_name=...)
```
以检查其提示模板、目标、采样和过滤规则，并推断其覆盖的质量维度。在单个消息中发起所有评估器调用（并行化）。跳过
```
source=ootb
```
的评估器——它们的名称自描述，且可能无法获取配置。
此步骤结束后，你将获得完整的覆盖范围映射：
```
{eval_name → source, enabled, dimension}
```
。将其带入阶段2以进行去重。
在发布模式下，还需注意现有自定义评估器已使用的模板变量约定（以便新套件保持一致）。在线评估器模板针对完整Span JSON解析，而非
```
EvaluatorContext
```
。有关支持的语法（
```
{{span_input}}
```
、
```
{{span_output}}
```
、点路径、数组选择器、过滤器访问器），请参阅“发布约定”下的“在线模板变量”部分。
Notebook上下文检测：扫描当前对话，查找由
```
/eval-trace-rca
```
生成的Datadog Notebook URL（模式：
```
https://app.datadoghq.com/notebook/{numeric-id}
```
）。如果找到，将其存储为
```
rca_notebook_url
```
，并提取数字ID作为
```
rca_notebook_id
```
。阶段3结束后，将使用此信息提供将评估器套件附加到该Notebook而非创建新Notebook的选项。

Phase 1: Explore Traces & Identify Eval Targets

阶段1：探索Trace并确定评估目标

Goal: Sample production traces, understand what the app does, and identify quality dimensions worth measuring.

目标：采样生产Trace，了解应用功能，确定值得测量的质量维度。

Cold Start Path

冷启动路径

Sample the app:

search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", root_spans_only=true, limit=50, from=<timeframe>)

. Filter by

@status:ok

— error spans have no output to evaluate.

Profile the app and identify evaluation target spans: Call

get_llmobs_span_details

for span_ids grouped by trace_id. Inspect

content_info

to classify:

Signal	App Profile
`content_info` has `messages`	LLM/chat app
`content_info` has `documents`	RAG app
Spans include `agent` kind	Agent app
`content_info` has `metadata`	Has custom metadata
Multiple span kinds in one trace ( `agent` + `tool` / `retrieval` + `llm` from `get_llmobs_trace` )	Multi-step app — at least one trace-scope evaluator likely belongs in the suite ( `publish` mode)

For agent/multi-step apps, also call

get_llmobs_trace

on 2-3 traces to see the full span hierarchy. Compare

content_info

between the root span and its sub-spans. Then ask two questions for each candidate quality dimension, in this order:

Does the verdict depend on more than one span? (e.g., faithfulness depends on a
```
retrieval
```
span's documents AND an
```
llm
```
span's answer; goal completion depends on the chain of
```
tool
```
calls AND the final response.) If yes → trace scope in
```
publish
```
mode. Don't try to compress this into a single span.
Only if the answer to (1) is no: pick the single span with the richest signal for that dimension (root has the summary; LLM sub-spans have the full system prompt + tool call results + reasoning chain).

Record the span-kind histogram (agent + tool + llm + retrieval) — multiple kinds under one root is a strong signal you'll have at least one trace-scope evaluator in the suite. See Phase 2's "Span vs. Trace Scope Classification" for the mandatory walk-through of canonical trace-scope use cases.

Extract content and identify targets: Call

get_llmobs_span_content

for representative spans. Fetch fields based on app profile:

App Profile	Fields to Fetch
LLM/chat	`messages` ( `path=$.messages[0]` for system prompt), `output`
RAG	`documents` , `input` , `output`
Agent	`get_llmobs_agent_loop` for the agent span, then `messages` for detail
Any with metadata	`metadata`

Issue all calls in a single message. As you read, capture two streams of signal:

Generic quality signals — what does "success" look like? What variance exists across outputs? Each observed quality dimension becomes a candidate evaluator, with the traces you've just read as evidence. Also look for safety signals (scope violations, sensitive data in outputs, out-of-character responses) and add a safety evaluator if you find them.

Domain signals — these become the domain-specific evaluator category in Phase 2 (the highest-leverage category). For every 5–10 traces, write down:

Recurring intents / question categories — what classes of request does this app handle? (
```
applying for benefit X
```
,
```
comparing flight options
```
,
```
summarizing a policy
```
,
```
creating a widget
```
)
Entities the app emits in outputs — URLs, agency / company names, code identifiers, monetary amounts, dates, IDs, file paths, phone numbers. Note which ones the user acts on downstream (those are worth a correctness evaluator) versus which are passing references.
Tool argument shapes (for agent apps) — name each tool the agent calls and the rough schema of its inputs. Tools with non-trivial schemas (≥ 3 fields, structured types) are candidates for argument-correctness evaluators.
Persona / voice rules — does the app always cite a source, always refuse certain topics (medical, legal, financial advice), always speak in a particular tone? Extract the rules implicitly followed across observed outputs.
Failure modes specific to the domain — fabricated identifiers, outdated policy references, currency / locale mismatches, off-by-one errors in IDs, wrong units. One observed instance is enough to seed a candidate evaluator.

Don't try to enumerate domain signals exhaustively before reading traces — let the patterns surface as you read. The goal is breadth in the eventual proposal, not completeness in this exploration step.

采样应用：

search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", root_spans_only=true, limit=50, from=<timeframe>)

。按

@status:ok

过滤——错误Span无输出可评估。

分析应用概况并确定评估目标Span：按trace_id分组调用

get_llmobs_span_details

获取span_ids。检查

content_info

进行分类：

信号	应用概况
`content_info` 包含 `messages`	LLM/聊天应用
`content_info` 包含 `documents`	RAG应用
Span包含 `agent` 类型	代理应用
`content_info` 包含 `metadata`	包含自定义元数据
单个Trace中包含多种Span类型（ `agent` + `tool` / `retrieval` + `llm` ，来自 `get_llmobs_trace` ）	多步骤应用——套件中可能至少包含一个Trace范围的评估器（发布模式）

对于代理/多步骤应用，还需调用

get_llmobs_trace

获取2-3个Trace的完整Span层级结构。比较根Span与其子Span的

content_info

。然后针对每个候选质量维度依次提出两个问题：

判断结果是否依赖多个Span？（例如，忠实度依赖
```
retrieval
```
Span的文档和
```
llm
```
Span的回答；目标完成依赖
```
tool
```
调用链和最终响应。）如果是 → 发布模式下使用Trace范围。不要尝试将其压缩到单个Span中。
仅当(1)的答案为否时：选择该维度信号最丰富的单个Span（根Span包含摘要；LLM子Span包含完整系统提示 + 工具调用结果 + 推理链）。

记录Span类型直方图（agent + tool + llm + retrieval）——单个根Span下包含多种类型是套件中至少包含一个Trace范围评估器的强烈信号。有关规范Trace范围用例的强制说明，请参阅阶段2的“Span vs Trace范围分类”。

提取内容并确定目标：调用

get_llmobs_span_content

获取代表性Span的内容。根据应用概况获取字段：

应用概况	要获取的字段
LLM/聊天	`messages` （ `path=$.messages[0]` 获取系统提示）、 `output`
RAG	`documents` 、 `input` 、 `output`
代理	获取代理Span的 `get_llmobs_agent_loop` ，然后获取 `messages` 详情
包含元数据的任何应用	`metadata`

在单个消息中发起所有调用。阅读时，捕获两类信号：

通用质量信号 — “成功”的表现是什么？输出存在哪些差异？每个观察到的质量维度都成为候选评估器，你刚刚读取的Trace作为证据。同时查找安全信号（范围违规、输出中的敏感数据、不符合角色的响应），如果发现则添加安全评估器。

领域信号 — 这些将成为阶段2中的领域特定评估器类别（价值最高的类别）。每读取5-10个Trace，记录：

重复意图/问题类别 — 应用处理哪些类型的请求？（
```
申请福利X
```
、
```
比较航班选项
```
、
```
总结政策
```
、
```
创建小部件
```
）
应用在输出中生成的实体 — URL、机构/公司名称、代码标识符、金额、日期、ID、文件路径、电话号码。注意哪些是用户后续会操作的（这些值得添加正确性评估器），哪些只是引用。
工具参数形状（针对代理应用） — 命名代理调用的每个工具及其输入的大致schema。具有非平凡schema（≥3个字段、结构化类型）的工具是参数正确性评估器的候选对象。
角色/语气规则 — 应用是否始终引用来源、始终拒绝某些主题（医疗、法律、财务建议）、始终使用特定语气？从观察到的输出中提取隐含遵循的规则。
特定领域的故障模式 — 虚构标识符、过时政策引用、货币/区域不匹配、ID中的差一错误、错误单位。只要观察到一个实例，就足以作为候选评估器的种子。

不要在读取Trace前尝试穷举领域信号——让模式在读取过程中自然浮现。目标是最终提案的广度，而非此探索步骤的完整性。

From RCA Path

来自RCA的路径

Extract the failure taxonomy from the RCA report. Each failure mode with High or Medium severity becomes an eval target.
Check root cause categories for infrastructure failures. Before proposing evaluators, scan the Root Cause column of the taxonomy for any of:
```
Instrumentation Deficiency
```
,
```
Harness Deficiency
```
,
```
Runtime Error
```
,
```
Upstream Data Issue
```
, or any other root cause that points to infrastructure/environment rather than model behavior. If any are present, pause and ask:
"Some failure modes were diagnosed as infrastructure or instrumentation issues rather than model behavior (e.g.,
```
{list the infra root causes}
```
). Evaluators can be designed two ways:
- Behavior-targeted (recommended for ongoing quality): measure whether the model produces correct, specific output — useful once the infrastructure is fixed and you want to track real quality
- Artifact-targeted (useful as regression guard): detect the specific broken output observed (e.g., generic placeholder responses) — catches regressions if the infrastructure breaks again
Which approach do you want, or both?"
- If behavior-targeted: design evaluators for what correct output looks like, not what the broken output looked like. Use the RCA's
```
expected_output
```
  / gold-standard examples as the quality bar.
- If artifact-targeted: design evaluators that detect the specific failure symptom (e.g.,
```
StringCheckEvaluator
```
  for a known bad string,
```
LLMJudge
```
  that checks for generic placeholders).
- If both: propose each category separately, clearly labelled.
If all root causes are behavioral (System Prompt Deficiency, Tool Gap, Tool Misuse, Retrieval Failure, etc.) → skip this step and proceed directly.
For each target: if the RCA includes trace IDs, use them directly; otherwise search for matching traces. Fetch 2-3 traces per target with
```
get_llmobs_span_content
```
to understand the concrete pattern.

从RCA报告中提取故障分类。每个具有高或中严重性的故障模式都成为评估目标。
检查根本原因类别是否包含基础设施故障。在提出评估器前，扫描分类的“Root Cause”列，查找以下任何一项：
```
Instrumentation Deficiency
```
、
```
Harness Deficiency
```
、
```
Runtime Error
```
、
```
Upstream Data Issue
```
，或任何其他指向基础设施/环境而非模型行为的根本原因。如果存在，暂停并询问：
"某些故障模式被诊断为基础设施或工具问题，而非模型行为（例如
```
{列出基础设施根本原因}
```
）。评估器可通过两种方式设计：
- 行为导向（推荐用于持续质量监控）：测量模型是否生成正确、特定的输出——在基础设施修复后，用于跟踪实际质量
- ** artifact导向**（用作回归防护）：检测观察到的特定故障输出（例如通用占位符响应）——如果基础设施再次故障，可捕获回归问题
你想要哪种方法，还是两者都要？"
- 如果选择行为导向：设计评估器以衡量正确输出的表现，而非故障输出的表现。使用RCA中的
```
expected_output
```
  /黄金标准示例作为质量标准。
- 如果选择artifact导向：设计评估器以检测特定故障症状（例如针对已知错误字符串的
```
StringCheckEvaluator
```
  、检查通用占位符的
```
LLMJudge
```
  ）。
- 如果选择两者都要：分别提出每个类别，明确标记。
如果所有根本原因都是行为相关的（System Prompt Deficiency、Tool Gap、Tool Misuse、Retrieval Failure等）→ 跳过此步骤，直接继续。
针对每个目标：如果RCA包含Trace ID，直接使用；否则搜索匹配的Trace。调用
```
get_llmobs_span_content
```
获取每个目标的2-3个Trace，以了解具体模式。

Phase 2: Propose Evaluator Suite

阶段2：提出评估器套件

Goal: Present a concrete evaluator proposal for user confirmation.

Each evaluator judges one data point — it receives input and output for a single record/span, not a full trace or batch. Design evaluators accordingly.

Targeting depends on
output_mode
:

```
sdk_code
```
/
```
data_only
```
→ offline experiments. Template variables use
```
EvaluatorContext
```
fields (
```
{{input_data}}
```
,
```
{{output_data}}
```
). The actual data shape depends on the user's dataset and task function (see EvaluatorContext note in SDK Reference).
```
publish
```
→ online evaluation on production spans. Template variables resolve against the full span JSON via dot-paths (
```
{{meta.input.value}}
```
,
```
{{meta.output.messages[*].content}}
```
, …) or the built-in span-kind-aware aliases (
```
{{span_input}}
```
,
```
{{span_output}}
```
). See "Online Template Variables" under Publishing Conventions for the full syntax. Each evaluator also needs
```
eval_scope
```
,
```
sampling_percentage
```
, and (optionally)
```
filter
```
— surface these in the proposal table so the user can confirm before publishing.

Order proposals from broadest signal to most granular. Propose broadly, let the user curate — see "How many evaluators to propose" below.

Domain-specific evaluators — What does "good" mean for this specific app? These are the highest-leverage proposals because they capture quality bars generic evaluators miss. Derive them from the domain signals Phase 1 captured:
- Recurring intents / question categories the app handles (e.g., "applying for a federal benefit", "comparing flight options", "explaining a policy"). Propose an
```
intent_classification
```
  or
```
intent_handling_correctness
```
  evaluator scoped to the dominant intents.
- Specific entities the app produces (URLs, agency names, code identifiers, monetary amounts, dates, IDs). Propose a per-entity correctness evaluator for the ones with real downstream cost when wrong (e.g.,
```
cited_url_is_real
```
  ,
```
agency_name_matches_request
```
  ,
```
monetary_amount_is_consistent_with_input
```
  ).
- Tool argument shapes observed across
```
tool
```
  spans. Propose a per-tool argument-correctness evaluator for the tools with non-trivial schemas (e.g.,
```
search_flights_args_match_user_request
```
  ,
```
update_dashboard_widget_targets_correct_widget
```
  ).
- Persona / voice expectations — does the app always cite sources, always refuse out-of-scope requests, always speak in a specific tone? Propose evaluators for the voice rules you can extract from observed outputs (
```
cites_a_source
```
  ,
```
refuses_medical_advice
```
  ,
```
tone_matches_brand
```
  ).
- Domain-specific failure modes seen across traces (fabricated identifiers, outdated policy references, unit mismatches, currency / locale mismatches). One evaluator per recurring failure mode.
Name each evaluator after the user-facing concern, not the technical check (
```
agency_url_is_real
```
over
```
regex_url_match
```
). Use the trace IDs you read in Phase 1 as evidence — at least one passing case and one failing case per evaluator if you saw both.
Outcome evaluators — Did this span / trace produce a good result for the request?
- Examples:
```
task_completion
```
  ,
```
answer_correctness
```
  ,
```
response_groundedness
```
Format evaluators — Does the output meet structural requirements?
- Examples:
```
valid_json_output
```
  ,
```
response_length
```
  ,
```
citation_format
```
Safety evaluators — Does the output stay within appropriate boundaries?
- Examples:
```
no_pii_leakage
```
  ,
```
scope_adherence
```
  ,
```
no_hallucination
```

目标：提出具体的评估器提案供用户确认。

每个评估器判断一个数据点——接收单个记录/Span的输入和输出，而非完整Trace或批量数据。据此设计评估器。

目标定位取决于
output_mode
：

```
sdk_code
```
/
```
data_only
```
→ 离线实验。模板变量使用
```
EvaluatorContext
```
字段（
```
{{input_data}}
```
、
```
{{output_data}}
```
）。实际数据形状取决于用户的数据集和任务函数（请参阅SDK参考中的EvaluatorContext说明）。
```
publish
```
→ 生产Span上的在线评估。模板变量通过点路径针对完整Span JSON解析（
```
{{meta.input.value}}
```
、
```
{{meta.output.messages[*].content}}
```
等），或使用内置的Span类型感知别名（
```
{{span_input}}
```
、
```
{{span_output}}
```
）。有关完整语法，请参阅发布约定下的“在线模板变量”。每个评估器还需要
```
eval_scope
```
、
```
sampling_percentage
```
和（可选）
```
filter
```
——在提案表格中显示这些内容，以便用户在发布前确认。

按从宽泛到精细的顺序排列提案。广泛提出，让用户筛选——请参阅下面的“要提出多少个评估器”。

领域特定评估器 — 对于此特定应用，“好”的定义是什么？这些是价值最高的提案，因为它们捕获了通用评估器无法覆盖的质量标准。从阶段1捕获的领域信号中推导：
- 应用处理的重复意图/问题类别（例如“申请联邦福利”、“比较航班选项”、“解释政策”）。针对主要意图提出
```
intent_classification
```
  或
```
intent_handling_correctness
```
  评估器。
- 应用生成的特定实体（URL、机构名称、代码标识符、金额、日期、ID）。针对错误会产生实际下游成本的实体提出每个实体的正确性评估器（例如
```
cited_url_is_real
```
  、
```
agency_name_matches_request
```
  、
```
monetary_amount_is_consistent_with_input
```
  ）。
- tool
  Span中观察到的工具参数形状。针对具有非平凡schema的工具提出每个工具的参数正确性评估器（例如
```
search_flights_args_match_user_request
```
  、
```
update_dashboard_widget_targets_correct_widget
```
  ）。
- 角色/语气期望 — 应用是否始终引用来源、始终拒绝超出范围的请求、始终使用特定语气？针对从观察到的输出中提取的语气规则提出评估器（
```
cites_a_source
```
  、
```
refuses_medical_advice
```
  、
```
tone_matches_brand
```
  ）。
- Trace中发现的特定领域故障模式（虚构标识符、过时政策引用、单位不匹配、货币/区域不匹配）。每个重复故障模式对应一个评估器。
每个评估器的名称以用户关注的问题命名，而非技术检查（例如使用
```
agency_url_is_real
```
而非
```
regex_url_match
```
）。使用阶段1中读取的Trace ID作为证据——如果同时看到通过和失败案例，每个评估器至少引用一个通过案例和一个失败案例。
结果评估器 — 此Span/Trace是否为请求生成了良好结果？
- 示例：
```
task_completion
```
  、
```
answer_correctness
```
  、
```
response_groundedness
```
格式评估器 — 输出是否符合结构要求？
- 示例：
```
valid_json_output
```
  、
```
response_length
```
  、
```
citation_format
```
安全评估器 — 输出是否保持在适当范围内？
- 示例：
```
no_pii_leakage
```
  、
```
scope_adherence
```
  、
```
no_hallucination
```

How many evaluators to propose

要提出多少个评估器

The default

4-6

cap from the older skill version was too tight — it pushed the skill toward generic evaluators only and left domain signals on the table. Updated guidance:

Aim for 8–15 evaluators in the proposal, distributed across all four categories (with domain-specific usually the largest bucket, outcome second, format and safety smaller). For very simple single-LLM-call apps, fewer is fine; for agent / RAG apps with rich domain signals, lean toward the upper end.
Quality > generic: every domain-specific proposal should be backed by at least one observed pattern in the sampled traces. Don't invent generic domain evaluators ("
```
response_quality
```
") if you don't have evidence for them.
Let the user curate: the MANDATORY CHECKPOINT below explicitly asks the user to remove what doesn't apply, not just to approve. Treat the proposal as a candidate set the user trims.

旧版技能的默认4-6个上限过于严格——它迫使技能仅提出通用评估器，而忽略领域信号。更新后的指南：

目标提出8-15个评估器，分布在所有四个类别中（领域特定通常是最大的类别，结果类其次，格式和安全类较小）。对于非常简单的单LLM调用应用，数量可以更少；对于具有丰富领域信号的代理/RAG应用，倾向于上限。
质量优先于通用：每个领域特定提案都必须至少有一个采样Trace中的观察模式作为支持。如果没有证据，不要发明通用领域评估器（例如
```
response_quality
```
）。
让用户筛选：下面的强制检查点明确要求用户移除不适用的评估器，而非仅仅批准。将提案视为用户会精简的候选集。

Deduplication Against Existing Coverage

与现有覆盖范围去重

In
data_only
mode: skip this section entirely (coverage map was not built in Phase 0). Proceed directly to the proposal table.

Before building the proposal, apply the coverage map from Phase 0. Coverage is keyed on
(dimension, scope)
— not on dimension alone: every OOTB evaluator runs at span scope, and an enabled OOTB eval does NOT preclude proposing a trace-scope evaluator for the same dimension. The two answer different questions.

Enabled span-scope eval (OOTB or custom) for dimension D:
- Do NOT propose a new span-scope evaluator for D — that dimension is already covered at span scope.
- DO propose a trace-scope evaluator for D when the trace shape calls for it (multi-step app, judgment depends on cross-span context). Note the relationship in the rationale: e.g., "OOTB
```
Goal Completeness
```
  evaluates each LLM span in isolation; this trace-scope
```
goal_completion
```
  checks whether the agent's full sequence of steps achieved the user's request — different question."
Enabled trace-scope custom eval for dimension D: do NOT propose another trace-scope evaluator for the same dimension; that's a real duplicate. Span-scope on the same dimension is still fair game if the data also fits a single span.
Disabled OOTB eval: Do NOT propose a new custom span-scope evaluator for that dimension. Instead, surface it in a short note within the proposal and suggest enabling it in the Datadog UI rather than creating a duplicate. Example:
```
hallucination
```
(ootb, disabled) — consider enabling in Datadog UI (Evaluations → Configure) instead of creating a custom span-scope eval. (A trace-scope
```
rag_faithfulness
```
is still in scope and covers a different question.)
Gap identification: Open the proposal with a coverage summary line: "Existing coverage: N evaluator(s) already configured ({names}, all span-scope unless noted). Proposing evaluators for uncovered dimensions and uncovered scopes."
All dimensions covered: A dimension is "fully covered" only when both scopes are present (or the scope doesn't apply to the app shape). If the coverage map accounts for every identified quality dimension at the appropriate scope(s), surface this explicitly and ask the user what they want: (a) review/improve existing eval prompts, (b) add coverage for additional dimensions, or (c) proceed anyway.

For each proposed evaluator:

Name: Must match
```
^[a-zA-Z0-9_-]+$
```
(alphanumeric, underscore, hyphen only)
Type:
```
LLMJudge
```
(Boolean/Score/Categorical/custom JSON schema), built-in (
```
JSONEvaluator
```
,
```
RegexMatchEvaluator
```
, etc.), or
```
BaseEvaluator
```
subclass. In
publish
mode, only LLM-judge evaluators are supported by the MCP tool — code-based checks must NOT be silently dropped. List them in the same proposal table with
Type
set to the code-based class, mark them under a "Not publishable in this mode" subsection of the proposal, and tell the user to run the skill again in default
sdk_code
mode (or
--data-only
) to capture them. Treat the code-based proposals as part of the suite for counting and coverage purposes.
What it measures: 1-2 sentence plain-language description
Target span: Which span's data the evaluator was designed for (e.g., "root agent span", "LLM sub-span
```
anthropic.request
```
", "all
```
llm
```
spans"). If the root span's I/O is too lossy for the quality dimension (e.g., tool call results aren't visible), note this and specify which sub-span has the signal. In
publish
mode this maps to a combination of
eval_scope
(
span
/
trace
/
session
),
root_spans_only
, and the EVP
filter
query (e.g.
@meta.span.kind:llm
or
service:web
).
Pass/fail criteria:
```
pass_when=True
```
,
```
min_threshold=7
```
,
```
pass_values=["correct"]
```
, or "no automatic assessment" for custom JSON schema

Template variables: Which of

input_data

output_data

expected_output

metadata.*

it uses (offline) — or which span paths / aliases it pulls from (publish mode:

{{span_input}}

{{span_output}}

{{meta.input.messages[*].content}}

{{meta.metadata.<key>}}

, etc.)

Evidence: At least one trace where it would have caught a failure (or confirmed correct behavior)
Publish-only fields (only in
publish
mode):
```
integration_provider
```
(default
```
openai
```
),
```
model_name
```
(default
```
gpt-5.4-mini
```
),
```
sampling_percentage
```
(default
```
10
```
),
```
eval_scope
```
(default
```
span
```
), and any
```
filter
```
query needed to scope to the right spans. Surface defaults in the proposal so the user can override before publishing.
integration_account_id
(only in
publish
mode): the integration account the judge LLM is called through. Auto-detected from existing evaluators in the same ml_app (Phase 0 coverage map). Never asked from the user as a raw UUID. If no existing evaluator has one, the field is omitted and the user picks an account in the UI before activating. All evaluators are published with
enabled: false
regardless — see "Always publish as draft" in Phase 3C for the full activation workflow.

在
data_only
模式下：完全跳过本节（阶段0未构建覆盖范围映射）。直接进入提案表格。

在构建提案前，应用阶段0的覆盖范围映射。覆盖范围以
(dimension, scope)
为键——而非仅维度：每个OOTB评估器都在Span范围运行，启用的OOTB评估器并不排除针对同一维度提出Trace范围评估器的可能性。两者回答的是不同的问题。

针对维度D的已启用Span范围评估器（OOTB或自定义）：
- 不要针对D提出新的Span范围评估器——该维度已在Span范围覆盖。
- 当Trace形状需要时（多步骤应用、判断依赖跨Span上下文），要针对D提出Trace范围评估器。在理由中说明关系：例如“OOTB
```
Goal Completeness
```
  单独评估每个LLM Span；此Trace范围的
```
goal_completion
```
  检查代理的完整步骤序列是否实现了用户请求——这是不同的问题。”
针对维度D的已启用Trace范围自定义评估器：不要针对同一维度提出另一个Trace范围评估器——这是真正的重复。如果数据也适合单个Span，同一维度的Span范围评估器仍然是可行的。
已禁用的OOTB评估器：不要针对该维度提出新的自定义Span范围评估器。相反，在提案中添加简短说明，建议在Datadog UI中启用它，而非创建重复项。示例：
```
hallucination
```
（ootb，已禁用）——考虑在Datadog UI（Evaluations → Configure）中启用，而非创建自定义Span范围评估器。（Trace范围的
```
rag_faithfulness
```
仍然适用，且覆盖不同的问题。）
差距识别：在提案开头添加覆盖范围摘要行：“现有覆盖范围：已配置N个评估器（{名称}，除非特别说明，否则均为Span范围）。针对未覆盖的维度和范围提出评估器。”
所有维度已覆盖：仅当两种范围都存在（或范围不适用于应用形状）时，维度才被“完全覆盖”。如果覆盖范围映射涵盖了所有已识别的质量维度及其适当范围，明确指出这一点并询问用户想要：(a) 审查/改进现有评估提示，(b) 添加对其他维度的覆盖，或(c) 继续执行。

针对每个拟议评估器：

名称：必须匹配
```
^[a-zA-Z0-9_-]+$
```
（仅字母数字、下划线、连字符）
类型：
```
LLMJudge
```
（布尔值/分数/分类/自定义JSON schema）、内置（
```
JSONEvaluator
```
、
```
RegexMatchEvaluator
```
等），或
```
BaseEvaluator
```
子类。在发布模式下，MCP工具仅支持LLM-judge评估器——基于代码的检查不得被静默丢弃。在同一提案表格中列出它们，将
Type
设置为基于代码的类，在提案的“此模式下不可发布”小节中标记，并告知用户重新运行技能的默认
sdk_code
模式（或
--data-only
）以捕获它们。将基于代码的提案视为套件的一部分，用于计数和覆盖范围目的。
测量内容：1-2句通俗易懂的描述
目标Span：评估器设计用于哪个Span的数据（例如“根代理Span”、“LLM子Span
```
anthropic.request
```
”、“所有
```
llm
```
Span”）。如果根Span的输入/输出对于质量维度来说信息损失过大（例如工具调用结果不可见），注明这一点并指定哪个子Span包含信号。在发布模式下，这对应于
eval_scope
（
span
/
trace
/
session
）、
root_spans_only
和EVP
filter
查询的组合（例如
@meta.span.kind:llm
或
service:web
）。
通过/失败标准：
```
pass_when=True
```
、
```
min_threshold=7
```
、
```
pass_values=["correct"]
```
，或自定义JSON schema的“无自动评估结果”

模板变量：使用

input_data

、

output_data

、

expected_output

、

metadata.*

中的哪些（离线）——或从哪些Span路径/别名提取（发布模式：

{{span_input}}

、

{{span_output}}

、

{{meta.input.messages[*].content}}

、

{{meta.metadata.<key>}}

等）

证据：至少一个它会捕获故障（或确认正确行为）的Trace
仅发布模式字段（仅在发布模式下）：
```
integration_provider
```
（默认
```
openai
```
）、
```
model_name
```
（默认
```
gpt-5.4-mini
```
）、
```
sampling_percentage
```
（默认
```
10
```
）、
```
eval_scope
```
（默认
```
span
```
），以及任何需要限定到正确Span的
```
filter
```
查询。在提案中显示默认值，以便用户在发布前覆盖。
integration_account_id
（仅在发布模式下）：Judge LLM调用所通过的集成账户。从同一ml_app中的现有评估器（阶段0覆盖范围映射）自动检测。切勿要求用户提供原始UUID。如果没有现有评估器包含该ID，省略该字段，用户在激活前在UI中选择账户。无论如何，所有评估器都以
enabled: false
发布——有关完整激活工作流，请参阅阶段3C中的“始终以草稿形式发布”。

Span vs. Trace Scope Classification (

publish

mode)

Span vs Trace范围分类（发布模式）

Don't ask the user; classify per evaluator and let them override at the checkpoint.

不要询问用户；针对每个评估器进行分类，让用户在检查点覆盖。

Mandatory: walk the four canonical trace-scope use cases first

强制：首先检查四个规范Trace范围用例

If Phase 1 found multi-step traces (≥ 2 span kinds, or any

tool

retrieval

workflow

span under an

agent

root), you MUST walk through the four canonical trace-scope use cases below before finalizing the suite. For each, decide explicitly: applies (include with

eval_scope: trace

) or does not apply (record a one-line reason in a "Skipped trace-scope candidates" subsection of the proposal). Skipping all four without per-item justification is a sign you've over-anchored on span scope — re-check.

Canonical use case	Triggers when
`goal_completion` — did the agent finish the user's request?	Any agent / multi-step app. Almost always applies.
`tool_use_correctness` — right tool with right arguments?	Trace contains `tool` kind spans.
`rag_faithfulness` — answer grounded in retrieved documents?	Trace contains `retrieval` kind spans.
`conversation_quality` — coherence across multi-turn LLM calls?	Trace contains ≥ 2 `llm` spans, or app instruments multi-turn sessions.

For other proposed evaluators (e.g. tone, format, safety), apply this two-question test:

Can the judgment be answered correctly from one span's
```
meta.input
```
+
```
meta.output
```
, where "correctly" means the verdict cannot change if you considered other spans in the trace? → eval_scope: span
.
Otherwise → eval_scope: trace
. In particular, default to trace when the evaluator name contains grounding, faithfulness, hallucination, completeness, correctness across steps, consistency, or workflow — these almost always need cross-span context.

如果阶段1发现多步骤Trace（≥2种Span类型，或

agent

根Span下的任何

tool

retrieval

workflow

Span），在最终确定套件前必须检查以下四个规范Trace范围用例。针对每个用例，明确决定：适用（包含

eval_scope: trace

）或不适用（在提案的“跳过的Trace范围候选”小节中记录一行理由）。如果没有逐项理由就跳过所有四个用例，表明你过度依赖Span范围——重新检查。

规范用例	触发条件
`goal_completion` — 代理是否完成了用户请求？	任何代理/多步骤应用。几乎总是适用。
`tool_use_correctness` — 是否使用了正确的工具和参数？	Trace包含 `tool` 类型Span。
`rag_faithfulness` — 回答是否基于检索到的文档？	Trace包含 `retrieval` 类型Span。
`conversation_quality` — 多轮LLM调用的连贯性？	Trace包含≥2个 `llm` Span，或应用工具支持多轮会话。

对于其他拟议评估器（例如语气、格式、安全），应用以下两个问题的测试：

判断结果是否可以仅从一个Span的
```
meta.input
```
+
```
meta.output
```
正确得出，其中“正确”意味着考虑Trace中的其他Span不会改变判断结果？ → eval_scope: span
。
否则 → eval_scope: trace
。特别是，当评估器名称包含grounding、faithfulness、hallucination、completeness、跨步骤正确性、consistency或workflow时，默认使用Trace范围——这些几乎总是需要跨Span上下文。

Trade-offs (don't let these dominate the choice)

权衡（不要让这些主导选择）

Trace scope costs more than span scope: one judgment per completed trace (vs. per matching span), larger prompt payloads, and a 3-minute trigger latency (Datadog waits 3 minutes of inactivity before considering a trace complete; later spans are excluded). These are cost-control levers — handle with

sampling_percentage

and

filter

, not by demoting scope. The correctness of the eval is what picks the scope.

Trace范围的成本高于Span范围：每个完成的Trace对应一次判断（vs每个匹配Span对应一次），提示负载更大，触发延迟为3分钟（Datadog等待3分钟无活动后才认为Trace完成；后续Span被排除）。这些是成本控制手段——通过

sampling_percentage

和

filter

处理，而非降低范围。评估的正确性才是选择范围的依据。

Surface the classification

显示分类结果

Add a Scope column to the proposal table and a one-sentence rationale per evaluator. If you skipped a canonical trace-scope use case, list it under a "Skipped trace-scope candidates" subsection with the reason — the user will see and can override.

Example rationales:
tone_check
— span. Judging "is this single response polite" needs only one LLM span's
meta.output.messages[*].content
; no other span in the trace can change that verdict.
goal_completion
— trace. Whether the agent finished the user's request depends on the sequence of tool calls and the final LLM response together —
meta.output
of any single span only shows that step's output.
tool_use_correctness
— trace. Comparing tool inputs against the request and the final response requires correlating ≥ 3 spans (root, tool, final LLM).
rag_faithfulness
— trace. Grounding pairs the
retrieval
span's documents with the LLM span's answer.
Example "Skipped trace-scope candidates" entry:
conversation_quality
— skipped: traces contain a single LLM call (no multi-turn signal in this app's instrumentation).

在提案表格中添加范围列，并为每个评估器添加一句理由。如果跳过了某个规范Trace范围用例，在“跳过的Trace范围候选”小节中列出并说明理由——用户会看到并可以覆盖。

示例理由：
tone_check
— Span范围。判断“此单个响应是否礼貌”仅需要一个LLM Span的
meta.output.messages[*].content
；Trace中的其他Span不会改变该判断结果。
goal_completion
— Trace范围。代理是否完成用户请求取决于工具调用序列和最终LLM响应的组合；任何单个Span的
meta.output
仅显示该步骤的输出。
tool_use_correctness
— Trace范围。将工具输入与请求和最终响应进行比较需要关联≥3个Span（根Span、工具Span、最终LLM Span）。
rag_faithfulness
— Trace范围。忠实度需要将
retrieval
Span的文档与LLM Span的回答配对。
示例“跳过的Trace范围候选”条目：
conversation_quality
— 跳过：Trace包含单个LLM调用（此应用的工具中无多轮信号）。

MANDATORY CHECKPOINT

强制检查点

You MUST output the proposal and wait for user confirmation before proceeding.

undefined

必须输出提案并等待用户确认后再继续。

undefined

Proposed Evaluator Suite

拟议评估器套件

App profile: {LLM | RAG | Agent | Multi-agent} Entry mode: {cold_start | from_rca}

#	Name	Type	Scope	Measures	Pass Criteria
1	task_completion	LLMJudge (Boolean)	span	Whether the task was completed on this span	pass_when=True
2	tool_use_correctness	LLMJudge (Categorical)	trace	Right tool with right arguments across the agent run	pass_values=["correct"]
3	...	...	...	...	...

(Drop the Scope column when not in

publish

mode.)

For each evaluator:

{name}: {what it measures}
- Target span: {which span's data it was designed for}
- Rationale: {which quality dimension it covers and why}
- {Only in publish mode:} Scope: {span | trace} — {one-sentence rationale}
- Evidence: Trace {id_short}

{Only in publish mode, for multi-step apps. Required if any of the four canonical trace-scope use cases was not included above:}

Skipped trace-scope candidates:

```
{canonical_use_case}
```
— {one-line reason it does not apply to this app}

{Only in publish mode, when the suite contains code-based evaluators (JSONEvaluator, RegexMatchEvaluator, LengthEvaluator, StringCheckEvaluator, BaseEvaluator). Required when any code-based proposal exists.}

Not publishable in this mode (code-based evaluators — the publish API is LLM-judge only):

```
{name}
```
({type}) — {what it would check}. Re-run
```
/eval-bootstrap {ml_app}
```
in default mode to emit as offline SDK code, or
```
/eval-bootstrap {ml_app} --data-only
```
for a framework-agnostic JSON spec.


**Which evaluators should I generate?** Treat the proposal as a candidate set — the suite below is intentionally broad so you can pick what matters for your team's quality bar. Reply with **which to keep, which to drop, and which to rename**; not every domain-specific proposal will fit your priorities. In `sdk_code` mode you may also add custom evaluators or change provider/model. In `publish` mode you may override `integration_provider`, `model_name`, `sampling_percentage`, `eval_scope`, `root_spans_only`, or `filter` per evaluator.

Do NOT proceed to code generation until the user confirms.

---

应用概况：{LLM | RAG | Agent | Multi-agent} 进入模式：{cold_start | from_rca}

#	名称	类型	范围	测量内容	通过标准
1	task_completion	LLMJudge (Boolean)	span	此Span上的任务是否完成	pass_when=True
2	tool_use_correctness	LLMJudge (Categorical)	trace	代理运行过程中是否使用了正确的工具和参数	pass_values=["correct"]
3	...	...	...	...	...

（非发布模式下删除范围列。）

针对每个评估器：

{name}：{测量内容}
- 目标Span：{设计用于哪个Span的数据}
- 理由：{覆盖的质量维度及原因}
- {仅发布模式下：} 范围：{span | trace} — {一句理由}
- 证据：Trace {id_short}

{仅发布模式下，针对多步骤应用。如果上述未包含四个规范Trace范围用例中的任何一个，必填：}

跳过的Trace范围候选：

```
{canonical_use_case}
```
— {不适用于此应用的一行理由}

{仅发布模式下，当套件包含基于代码的评估器（JSONEvaluator、RegexMatchEvaluator、LengthEvaluator、StringCheckEvaluator、BaseEvaluator）时。当存在任何基于代码的提案时必填：}

此模式下不可发布（基于代码的评估器——发布API仅支持LLM-judge）：

```
{name}
```
({type}) — {它会检查的内容}。重新运行
```
/eval-bootstrap {ml_app}
```
的默认模式以生成为离线SDK代码，或运行
```
/eval-bootstrap {ml_app} --data-only
```
以生成框架无关的JSON规范。


**我应该生成哪些评估器？** 将提案视为候选集——下面的套件故意设计得很宽泛，以便你选择符合团队质量标准的评估器。回复**保留哪些、删除哪些、重命名哪些**；并非所有领域特定提案都符合你的优先级。在`sdk_code`模式下，你还可以添加自定义评估器或更改提供商/模型。在发布模式下，你可以针对每个评估器覆盖`integration_provider`、`model_name`、`sampling_percentage`、`eval_scope`、`root_spans_only`或`filter`。

在用户确认前，不要继续生成代码。

---

Phase 3: Generate Output

阶段3：生成输出

Branch on

output_mode

```
sdk_code
```
→ Phase 3A below
```
data_only
```
→ skip to Phase 3B
```
publish
```
→ skip to Phase 3C

根据

output_mode

分支：

```
sdk_code
```
→ 下面的阶段3A
```
data_only
```
→ 跳至阶段3B
```
publish
```
→ 跳至阶段3C

Phase 3A: Generate & Write Evaluator Code

阶段3A：生成并写入评估器代码

Goal: Generate the final

.py

file and write it to disk.

For each confirmed evaluator, generate production-quality Python code following the SDK Reference patterns above.

目标：生成最终的

.py

文件并写入磁盘。

针对每个确认的评估器，按照上述SDK参考模式生成生产级Python代码。

Code Generation Rules

代码生成规则

Ground prompts in traces: LLMJudge system prompts and user prompts must reference patterns actually observed in production traces. Never write generic prompts like "evaluate whether the response is good" — ground them in the app's domain, observed failure patterns, and success criteria.
Keep template variables generic, add comments for context: Use
```
{{input_data}}
```
and
```
{{output_data}}
```
as top-level placeholders in prompts — do NOT reference nested span paths like
```
{{input_data.messages[-1].content}}
```
. The evaluator's data comes from the user's dataset and task function, not directly from spans. Instead, add a comment above each evaluator describing what data it was designed for and what the user should adapt:
python
```
# Designed for: input_data = user query, output_data = assistant response text
# Observed from: root agent span (input.value → output.value)
# If your dataset uses a different structure, adapt the prompt references below.
```
Use the narrowest evaluator type: If a check can be done with
```
JSONEvaluator
```
,
```
RegexMatchEvaluator
```
,
```
StringCheckEvaluator
```
, or
```
LengthEvaluator
```
, do NOT use an LLMJudge. Code-based evaluators are faster, cheaper, and deterministic.
BaseEvaluator subclasses:
- Call
```
super().__init__(name=name)
```
  in
```
__init__
```
- Return
```
EvaluatorResult
```
  from
```
evaluate()
```
- Do NOT modify instance attributes in
```
evaluate()
```
  (thread safety)
Names: Must match
```
^[a-zA-Z0-9_-]+$
```
. Use snake_case descriptive names.
Imports: Consolidate at the top of the file. Only import classes that are actually used.
Evaluator list: Collect all evaluators into an
```
evaluators
```
list at the bottom of the file.
Anonymize PII: Strip emails, names, and sensitive data from any trace content included in LLMJudge prompts or the header comment.

基于Trace编写提示：LLMJudge系统提示和用户提示必须引用生产Trace中实际观察到的模式。切勿编写通用提示，例如“评估响应是否良好”——要基于应用领域、观察到的故障模式和成功标准编写。
保持模板变量通用，添加注释说明上下文：在提示中使用
```
{{input_data}}
```
和
```
{{output_data}}
```
作为顶级占位符——不要引用嵌套Span路径，例如
```
{{input_data.messages[-1].content}}
```
。评估器的数据来自用户的数据集和任务函数，而非直接来自Span。相反，在每个评估器上方添加注释，说明其设计用于何种数据以及用户应如何适配：
python
```
# 设计用途：input_data = 用户查询，output_data = 助手响应文本
# 来自：根代理Span（input.value → output.value）
# 如果你的数据集使用不同结构，请适配下面的提示引用。
```
使用最窄的评估器类型：如果检查可以通过
```
JSONEvaluator
```
、
```
RegexMatchEvaluator
```
、
```
StringCheckEvaluator
```
或
```
LengthEvaluator
```
完成，不要使用LLMJudge。基于代码的评估器更快、更便宜且具有确定性。
BaseEvaluator子类：
- 在
```
__init__
```
  中调用
```
super().__init__(name=name)
```
- 从
```
evaluate()
```
  返回
```
EvaluatorResult
```
- 不要在
```
evaluate()
```
  中修改实例属性（线程安全）
名称：必须匹配
```
^[a-zA-Z0-9_-]+$
```
。使用蛇形命名法的描述性名称。
导入：在文件顶部合并导入。仅导入实际使用的类。
评估器列表：在文件底部将所有评估器收集到
```
evaluators
```
列表中。
匿名化PII：从LLMJudge提示或头部注释中包含的任何Trace内容中剥离电子邮件、姓名和敏感数据。

Output Format

输出格式

The generated

.py

file should follow this structure:

python

"""
Auto-generated evaluators for {ml_app}
Generated: {YYYY-MM-DD} by eval-bootstrap

App profile: {LLM | RAG | Agent | Multi-agent}

Quality dimensions covered:
  - {target_name}: {description}
    Evidence: https://app.datadoghq.com/llm/traces?query=trace_id:{full_id}
  ...

Usage:
    from ddtrace.llmobs import LLMObs

    experiment = LLMObs.experiment(
        name="my-experiment",
        task=my_task_fn,
        dataset=dataset,
        evaluators=evaluators,
    )
    experiment.run()
"""

{imports — only what is used}

生成的

.py

文件应遵循以下结构：

python

"""
为{ml_app}自动生成的评估器
生成时间：{YYYY-MM-DD} by eval-bootstrap

应用概况：{LLM | RAG | Agent | Multi-agent}

覆盖的质量维度：
  - {target_name}: {description}
    证据：https://app.datadoghq.com/llm/traces?query=trace_id:{full_id}
  ...

使用方法：
    from ddtrace.llmobs import LLMObs

    experiment = LLMObs.experiment(
        name="my-experiment",
        task=my_task_fn,
        dataset=dataset,
        evaluators=evaluators,
    )
    experiment.run()
"""

{导入——仅导入使用的类}

--- Outcome Evaluators ---

--- 结果评估器 ---

{evaluator code}

{评估器代码}

--- Format Evaluators ---

--- 格式评估器 ---

{evaluator code}

{评估器代码}

--- Safety Evaluators ---

--- 安全评估器 ---

{evaluator code}

{评估器代码}

--- Evaluator Suite ---

--- 评估器套件 ---

evaluators = [ {eval_1_variable_name}, {eval_2_variable_name}, ... ]


Only include section comments (Outcome/Format/Safety) for categories that have evaluators.

evaluators = [ {eval_1_variable_name}, {eval_2_variable_name}, ... ]


仅为包含评估器的类别添加节注释（结果/格式/安全）。

Write the file

写入文件

Write the generated code to the output path (suggest

./evals/{ml_app}_evaluators.py

if not specified), then display a summary:

undefined

将生成的代码写入输出路径（如果未指定，建议使用

./evals/{ml_app}_evaluators.py

），然后显示摘要：

undefined

Generated Evaluators

生成的评估器

Wrote {N} evaluators to

{output_path}

#	Name	Type	Covers
1	...	...	...

已将{N}个评估器写入

{output_path}

：

#	名称	类型	覆盖内容
1	...	...	...

Next Steps

后续步骤

Review: Check the generated prompts and criteria match your expectations
Test offline: Use
```
LLMObs.experiment(evaluators=evaluators)
```
to batch-evaluate against a labeled dataset and verify scores

undefined

审查：检查生成的提示和标准是否符合你的预期
离线测试：使用
```
LLMObs.experiment(evaluators=evaluators)
```
针对标记数据集批量评估并验证分数

undefined

Notebook export (after summary)

Notebook导出（摘要后）

After displaying the summary, offer notebook export.

If
rca_notebook_url
was detected in Phase 0:
An RCA notebook was created earlier in this session:
```
{rca_notebook_url}
```
Would you like to (a) append the evaluator suite summary to that notebook, or (b) create a new standalone notebook?
If append: use the notebook creation fallback pattern (see below) with
```
mcp__datadog-mcp__edit_datadog_notebook
```
(
```
id={rca_notebook_id}
```
,
```
append_only=true
```
, evaluator suite summary cell).
If new: use the notebook creation fallback pattern (see below) with
```
mcp__datadog-mcp__create_datadog_notebook
```
.
If no
rca_notebook_url
:
Would you like to export this evaluator suite summary to a Datadog notebook?
If yes: use the notebook creation fallback pattern (see below) with
```
mcp__datadog-mcp__create_datadog_notebook
```
:
- name
  :
```
Eval Bootstrap: {ml_app} — YYYY-MM-DD
```
- type
  :
```
report
```
- cells
  : single markdown cell with the evaluator suite summary
- time
  :
```
{ "live_span": "1h" }
```

Notebook creation fallback pattern (apply to every

create_datadog_notebook

edit_datadog_notebook

call):

Try the MCP tool first.
If the MCP call fails, inspect the error:
- Auth / permission error (401, 403) → stop and tell the user.
- Field validation error (error names a specific field) → fix that field and retry the MCP call once.
- Any other error (binding, serialization, unexpected response) → fall back to pup:
  - Write the payload to
```
/tmp/nb_bootstrap_{ml_app}.json
```
    as a full API envelope:
```
{"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}}
```
  - Run
```
pup notebooks create --file /tmp/nb_bootstrap_{ml_app}.json
```
  - If pup is not available either, render the notebook content as markdown in chat.
After successful creation by either method, output the URL:
```
Evaluator suite exported to notebook: <url>
```

Notebook cell content — the markdown cell should contain:

markdown

undefined

显示摘要后，提供Notebook导出选项。

如果阶段0检测到
rca_notebook_url
：
本次会话中之前创建了一个RCA Notebook：
```
{rca_notebook_url}
```
你想要(a) 将评估器套件摘要附加到该Notebook，还是(b) 创建新的独立Notebook？
如果选择附加：使用Notebook创建回退模式（如下所述），调用
```
mcp__datadog-mcp__edit_datadog_notebook
```
（
```
id={rca_notebook_id}
```
，
```
append_only=true
```
，评估器套件摘要单元格）。
如果选择新建：使用Notebook创建回退模式（如下所述），调用
```
mcp__datadog-mcp__create_datadog_notebook
```
。
如果未检测到
rca_notebook_url
：
是否要将此评估器套件摘要导出到Datadog Notebook？
如果是：使用Notebook创建回退模式（如下所述），调用
```
mcp__datadog-mcp__create_datadog_notebook
```
：
- name
  ：
```
Eval Bootstrap: {ml_app} — YYYY-MM-DD
```
- type
  ：
```
report
```
- cells
  ：包含评估器套件摘要的单个markdown单元格
- time
  ：
```
{ "live_span": "1h" }
```

Notebook创建回退模式（适用于每个

create_datadog_notebook

edit_datadog_notebook

调用）：

首先尝试MCP工具。
如果MCP调用失败，检查错误：
- 认证/权限错误（401、403） → 停止并告知用户。
- 字段验证错误（错误指出特定字段）→ 修复该字段并重试MCP调用一次。
- 任何其他错误（绑定、序列化、意外响应）→ 回退到pup：
  - 将负载写入
```
/tmp/nb_bootstrap_{ml_app}.json
```
    ，作为完整API包：
```
{"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}}
```
  - 运行
```
pup notebooks create --file /tmp/nb_bootstrap_{ml_app}.json
```
  - 如果pup也不可用，在聊天中渲染Notebook内容为markdown。
通过任一方法成功创建后，输出URL：
```
评估器套件已导出到Notebook：<url>
```

Notebook单元格内容 — markdown单元格应包含：

markdown

undefined

Eval Bootstrap: {ml_app}

Generated: YYYY-MM-DD | App profile: {LLM | RAG | Agent | Multi-agent} | Entry mode: {cold_start | from_rca} Generated code:

{output_path}

{One sentence: what does this app do?}

Coverage: {N} new evaluators ({comma-separated dimension names}) | {N} existing (unchanged: {names}) | {gaps if any: dimensions identified but not covered, and why}

生成时间：YYYY-MM-DD | 应用概况：{LLM | RAG | Agent | Multi-agent} | 进入模式：{cold_start | from_rca} 生成代码：

{output_path}

{一句话：此应用的功能是什么？}

覆盖范围：{N}个新评估器（{逗号分隔的维度名称}） | {N}个现有评估器（未更改：{名称}） | {如果有差距：已识别但未覆盖的维度及原因}

Evaluator Suite

评估器套件

#	Name	Type	Measures	Pass Criteria
1	...	...	...	...

#	名称	类型	测量内容	通过标准
1	...	...	...	...

Evidence

证据

{For each evaluator: name — 1-line description — [Trace link]}

{针对每个评估器：名称 — 一行描述 — [Trace链接]}

Next Steps

后续步骤

Review generated prompts in
```
{output_path}
```
Run against a labeled dataset to validate scores
Deploy to Datadog LLM Experiments

---

审查
```
{output_path}
```
中生成的提示
针对标记数据集运行以验证分数
部署到Datadog LLM Experiments

---

Phase 3B: Generate & Write Eval Spec JSON

阶段3B：生成并写入评估规范JSON

Goal: Serialize the confirmed evaluator suite and representative trace samples to a single self-contained JSON file — zero SDK dependencies.

Output path:

./evals/{ml_app}_eval_spec.json

目标：将确认的评估器套件和代表性Trace样本序列化为单个独立的JSON文件——无SDK依赖。

输出路径：

./evals/{ml_app}_eval_spec.json

JSON Schema

json

{
  "schema_version": "1",
  "generated_at": "<ISO 8601 UTC>",
  "generated_by": "eval-bootstrap",
  "app": {
    "ml_app": "<string>",
    "app_type": "LLM | RAG | Agent | Multi-agent",
    "trace_window": "<timeframe param, e.g. now-7d>",
    "trace_count": "<integer>"
  },
  "evaluators": [
    {
      "name": "snake_case_name",
      "category": "outcome | format | safety",
      "type": "llm_judge | code_check",
      "description": "<1-2 sentence plain-language description>",
      "target_span": "<which span: root, llm sub-span, etc.>",
      "scoring": {
        "scale": "boolean | score_1_10 | categorical",
        "categories": ["<only present when scale=categorical>"],
        "pass_criteria": "<human-readable: true, >= 7, in [correct], etc.>"
      },
      "rubric": "<full prompt text for llm_judge; null for code_check>",
      "implementation_hints": {
        "type_if_code_check": "json_valid | regex | contains | length_words | null",
        "pattern_if_code_check": "<pattern string or null>",
        "notes": "<optional framework-agnostic implementation guidance>"
      },
      "evidence": [
        {
          "trace_id": "<32-char hex>",
          "span_id": "<16-char hex>",
          "url": "https://app.datadoghq.com/llm/traces?query=trace_id:<trace_id>",
          "observation": "<why this trace illustrates the evaluator>"
        }
      ]
    }
  ],
  "sample_records": [
    {
      "trace_id": "<string>",
      "span_id": "<string>",
      "input": {},
      "output": "<string>",
      "suggested_labels": {
        "<evaluator_name>": "pass | fail | <score>"
      }
    }
  ]
}

json

{
  "schema_version": "1",
  "generated_at": "<ISO 8601 UTC>",
  "generated_by": "eval-bootstrap",
  "app": {
    "ml_app": "<string>",
    "app_type": "LLM | RAG | Agent | Multi-agent",
    "trace_window": "<timeframe参数，例如now-7d>",
    "trace_count": "<integer>"
  },
  "evaluators": [
    {
      "name": "snake_case_name",
      "category": "outcome | format | safety",
      "type": "llm_judge | code_check",
      "description": "<1-2句通俗易懂的描述>",
      "target_span": "<哪个Span：根Span、llm子Span等>",
      "scoring": {
        "scale": "boolean | score_1_10 | categorical",
        "categories": ["<仅当scale=categorical时存在>"],
        "pass_criteria": "<人类可读：true, >= 7, in [correct]等>"
      },
      "rubric": "<llm_judge的完整提示文本；code_check为null>",
      "implementation_hints": {
        "type_if_code_check": "json_valid | regex | contains | length_words | null",
        "pattern_if_code_check": "<模式字符串或null>",
        "notes": "<可选的框架无关实现指导>"
      },
      "evidence": [
        {
          "trace_id": "<32字符十六进制>",
          "span_id": "<16字符十六进制>",
          "url": "https://app.datadoghq.com/llm/traces?query=trace_id:<trace_id>",
          "observation": "<此Trace如何说明评估器>"
        }
      ]
    }
  ],
  "sample_records": [
    {
      "trace_id": "<string>",
      "span_id": "<string>",
      "input": {},
      "output": "<string>",
      "suggested_labels": {
        "<evaluator_name>": "pass | fail | <score>"
      }
    }
  ]
}

Field Notes

字段说明

evaluators[].type
:
```
"llm_judge"
```
for semantic evaluators;
```
"code_check"
```
for deterministic checks (regex, length, JSON validity, etc.).
evaluators[].rubric
: For
```
llm_judge
```
— full prompt text grounded in observed trace patterns. Use
```
{{input}}
```
and
```
{{output}}
```
as generic placeholders (not
```
{{input_data}}
```
— that's ddeval-specific). For
```
code_check
```
— null.
evaluators[].implementation_hints.notes
: Optional framework-agnostic guidance, e.g. "For OpenAI Evals, use
```
rubric
```
as a model-graded criterion. For Braintrust, use as an LLM scorer. For Promptfoo, use as an
```
llm-rubric
```
assertion."
sample_records
: 10–20 representative traces from Phase 1.
```
suggested_labels
```
are Claude's best-read from trace inspection — not ground truth. The field name communicates this explicitly.
PII rule: Strip emails, names, and sensitive data from all
```
input
```
,
```
output
```
, and
```
evidence[].observation
```
fields before writing (same as Phase 3A).

evaluators[].type
：
```
"llm_judge"
```
用于语义评估器；
```
"code_check"
```
用于确定性检查（正则、长度、JSON有效性等）。
evaluators[].rubric
：对于
```
llm_judge
```
——基于观察到的Trace模式的完整提示文本。使用
```
{{input}}
```
和
```
{{output}}
```
作为通用占位符（而非
```
{{input_data}}
```
——这是ddeval特定的）。对于
```
code_check
```
——null。
evaluators[].implementation_hints.notes
：可选的框架无关指导，例如“对于OpenAI Evals，使用
```
rubric
```
作为模型评分标准。对于Braintrust，将其用作LLM评分器。对于Promptfoo，将其用作
```
llm-rubric
```
断言。”
sample_records
：来自阶段1的10-20个代表性Trace。
```
suggested_labels
```
是Claude通过Trace检查得出的最佳猜测——并非基准真值。字段名称明确传达了这一点。
PII规则：在写入前，从所有
```
input
```
、
```
output
```
和
```
evidence[].observation
```
字段中剥离电子邮件、姓名和敏感数据（与阶段3A相同）。

Writing Instructions

写入说明

Assemble the JSON object in memory following the schema above.
Populate
```
sample_records
```
from traces already fetched in Phase 1. Fetch additional traces (up to 20 total) if fewer than 10 were read.
Anonymize PII in all
```
input
```
,
```
output
```
, and
```
evidence[].observation
```
fields.
Write the file with 2-space indentation using the Write tool.
Display a completion summary:

undefined

在内存中按照上述schema组装JSON对象。
从阶段1已获取的Trace中填充
```
sample_records
```
。如果读取的Trace少于10个，获取额外的Trace（最多20个）。
匿名化所有
```
input
```
、
```
output
```
和
```
evidence[].observation
```
字段中的PII。
使用Write工具写入文件，使用2空格缩进。
显示完成摘要：

undefined

Generated Eval Spec

生成的评估规范

Wrote

./evals/{ml_app}_eval_spec.json

{N} evaluators ({outcome_count} outcome, {format_count} format, {safety_count} safety)
{M} sample records with suggested labels

#	Name	Category	Type	Pass Criteria
1	...	...	...	...

已写入

./evals/{ml_app}_eval_spec.json

：

{N}个评估器（{outcome_count}个结果类，{format_count}个格式类，{safety_count}个安全类）
{M}个样本记录，包含建议标签

#	名称	类别	类型	通过标准
1	...	...	...	...

Next Steps

后续步骤

Review: Open
```
./evals/{ml_app}_eval_spec.json
```
and verify the rubrics match your expectations
Implement: Use the
```
rubric
```
field to configure evaluators in your framework of choice:
- OpenAI Evals: use
```
rubric
```
  as a model-graded criterion
- Braintrust: create an LLM scorer with the rubric text
- Promptfoo: use as an
```
llm-rubric
```
  assertion
- Custom code: call your LLM API with the rubric and parse the structured output
Label:
```
suggested_labels
```
are Claude's best guesses from trace inspection — verify against ground truth before using as training data

undefined

审查：打开
```
./evals/{ml_app}_eval_spec.json
```
并验证评估准则是否符合你的预期
实现：使用
```
rubric
```
字段在你选择的框架中配置评估器：
- OpenAI Evals：将
```
rubric
```
  用作模型评分标准
- Braintrust：使用评估准则文本创建LLM评分器
- Promptfoo：将其用作
```
llm-rubric
```
  断言
- 自定义代码：使用评估准则调用LLM API并解析结构化输出
标记：
```
suggested_labels
```
是Claude通过Trace检查得出的最佳猜测——在用作训练数据前，针对基准真值进行验证

undefined

Notebook export (after summary)

Notebook导出（摘要后）

Same logic as Phase 3A — offer to append to the RCA notebook if

rca_notebook_url

was detected, or create a new standalone notebook. Use the same notebook cell format as Phase 3A, substituting

output_path

with the JSON spec file path. In pup mode, use

pup notebooks create

pup notebooks edit

as described in Phase 3A.

与阶段3A逻辑相同——如果检测到

rca_notebook_url

，提供附加到RCA Notebook的选项，否则创建新的独立Notebook。使用与阶段3A相同的Notebook单元格格式，将

output_path

替换为JSON规范文件路径。在pup模式下，按照阶段3A中的描述使用

pup notebooks create

pup notebooks edit

。

Phase 3C: Publish Online Evaluators to Datadog

阶段3C：将在线评估器发布到Datadog

Goal: For each confirmed evaluator, write an LLM-judge configuration to Datadog via

create_or_update_llmobs_evaluator

so it runs automatically on matching production spans.

目标：针对每个确认的评估器，通过

create_or_update_llmobs_evaluator

将LLM-judge配置写入Datadog，使其自动在匹配的生产Span上运行。

Pre-publish checks (single message — parallelize)

发布前检查（单个消息——并行化）

For every proposed

eval_name

, call

get_llmobs_evaluator(eval_name=...)

Not found → safe to create.
Found → existing evaluator with the same name. Surface a diff to the user (existing dimension/prompt vs. proposed) and ask:
Evaluator
```
{name}
```
already exists. Overwrite, rename, or skip?
If overwrite: keep the fetched config as the base and merge your generated fields on top, then send the complete object back. The MCP tool is full-replace — any field you omit (e.g.
```
temperature
```
,
```
max_tokens
```
,
```
filter
```
,
```
sampling_percentage
```
) reverts to its default. Never re-publish without round-tripping the existing config.
If rename: append a suffix (e.g.
```
_v2
```
) and treat as new.
If skip: drop from the publish set.

针对每个拟议的

eval_name

，调用

get_llmobs_evaluator(eval_name=...)

：

未找到 → 可以安全创建。
找到 → 存在同名的现有评估器。向用户显示差异（现有维度/提示与拟议内容）并询问：
评估器
```
{name}
```
已存在。是否覆盖、重命名或跳过？
如果选择覆盖：将获取的配置作为基础，合并生成的字段，然后将完整对象返回。MCP工具是全替换模式——任何省略的字段（例如
```
temperature
```
、
```
max_tokens
```
、
```
filter
```
、
```
sampling_percentage
```
）都会重置为默认值。切勿不往返现有配置就重新发布。
如果选择重命名：添加后缀（例如
```
_v2
```
）并视为新评估器。
如果选择跳过：从发布集中移除。

Publishing Conventions

发布约定

Required parameters for each

create_or_update_llmobs_evaluator

call:

eval_name

application_name

ml_app

enabled

integration_provider

model_name

prompt_template

parsing_type

output_schema

, plus a

telemetry.intent

string.

Defaults to use unless the user overrides:

Field	Default
`enabled`	`false` (always — see "Always publish as draft")
`integration_provider`	`openai`
`model_name`	`gpt-5.4-mini`
`temperature`	`0`
`parsing_type`	`structured_output`
`sampling_percentage`	`10` for span scope, `5` for trace scope
`eval_scope`	`span` (auto-promoted to `trace` per the classification rule in Phase 2)

Prompt template: convert the LLMJudge prompt into the MCP shape — an ordered array of

{role, content}

messages. The system prompt becomes

{role: "system"}

, the user prompt becomes

{role: "user"}

. Use span-data placeholders (see below) — not the offline

{{input_data}}

{{output_data}}

form, which only exists in

EvaluatorContext

每个

create_or_update_llmobs_evaluator

调用的必填参数：

eval_name

、

application_name

（=

ml_app

）、

enabled

、

integration_provider

、

model_name

、

prompt_template

、

parsing_type

、

output_schema

，以及

telemetry.intent

字符串。

默认值，除非用户覆盖：

字段	默认值
`enabled`	`false` （始终如此——请参阅“始终以草稿形式发布”）
`integration_provider`	`openai`
`model_name`	`gpt-5.4-mini`
`temperature`	`0`
`parsing_type`	`structured_output`
`sampling_percentage`	Span范围为 `10` ，Trace范围为 `5`
`eval_scope`	`span` （根据阶段2的分类规则自动升级为 `trace` ）

提示模板：将LLMJudge提示转换为MCP格式——有序的

{role, content}

消息数组。系统提示变为

{role: "system"}

，用户提示变为

{role: "user"}

。使用Span数据占位符（如下所述）——不要使用离线的

{{input_data}}

{{output_data}}

形式，这仅存在于

EvaluatorContext

中。

Online Template Variables

在线模板变量

Online evaluator prompts run through the dd-source

template

library (

domains/ml-observability/shared/libs/template

). Missing paths → empty string. The data shape templates resolve against depends on
eval_scope
:

eval_scope: span
(default) — placeholders resolve against a single span's JSON (the
```
llmobs.Span
```
JSON-marshaled to a map). Use the span aliases / dot-paths below directly.
eval_scope: trace
— placeholders resolve against the trace payload
```
{ spans: [...] }
```
. Use
```
{{spans[N]...}}
```
,
```
{{spans[*]...}}
```
, or
```
{{spans[field.path:value]...}}
```
to select span(s) before applying field paths. The
```
{{span_input}}
```
/
```
{{span_output}}
```
aliases are not available in trace scope — reference span data through the
```
spans
```
array instead.
eval_scope: session
— not supported by this skill; classify as
```
span
```
and surface the limitation to the user.

在线评估器提示通过dd-source

template

库（

domains/ml-observability/shared/libs/template

）解析。路径不存在→空字符串。模板解析的数据形状取决于
eval_scope
：

eval_scope: span
（默认）——占位符针对单个Span的JSON（
```
llmobs.Span
```
序列化为映射）解析。直接使用下面的Span别名/点路径。
eval_scope: trace
——占位符针对Trace负载
```
{ spans: [...] }
```
解析。使用
```
{{spans[N]...}}
```
、
```
{{spans[*]...}}
```
或
```
{{spans[field.path:value]...}}
```
选择Span，然后应用字段路径。
```
{{span_input}}
```
/
```
{{span_output}}
```
别名在Trace范围中不可用——通过
```
spans
```
数组引用Span数据。
eval_scope: session
——本技能不支持；分类为
```
span
```
并向用户说明限制。

Span-scope (

eval_scope: span

)

Span范围（

eval_scope: span

）

Built-in span-kind-aware aliases (preferred when the evaluator is generic across span kinds):

Alias	LLM span ( `meta.span.kind = "llm"` )	Other spans (agent, workflow, task, …)
`{{span_input}}`	`meta.input.messages[*].content`	`meta.input.value`
`{{span_output}}`	`meta.output.messages[*].content`	`meta.output.value`

Common explicit dot-paths (use when the evaluator is purpose-built for one span kind):

Path	What you get
`{{meta.input.value}}` / `{{meta.output.value}}`	Plain string I/O on agent / workflow / task / tool spans
`{{meta.input.messages[*].content}}`	All input message contents on an LLM span (newline-joined)
`{{meta.input.messages[0].content}}`	First message (typically system prompt)
`{{meta.output.messages[*].content}}`	Assistant response(s)
`{{meta.input.documents}}`	Retrieved docs (RAG) — JSON-serialized
`{{meta.metadata.<key>}}`	Custom metadata fields
`{{meta.tool_definitions}}`	Available tools — JSON array
`{{*}}`	Entire span as compact JSON (debug / fall-back catch-all)

内置的Span类型感知别名（当评估器跨Span类型通用时优先使用）：

别名	LLM Span（ `meta.span.kind = "llm"` ）	其他Span（agent、workflow、task等）
`{{span_input}}`	`meta.input.messages[*].content`	`meta.input.value`
`{{span_output}}`	`meta.output.messages[*].content`	`meta.output.value`

常见显式点路径（当评估器专为一种Span类型设计时使用）：

路径	获取内容
`{{meta.input.value}}` / `{{meta.output.value}}`	agent/workflow/task/tool Span的纯字符串输入/输出
`{{meta.input.messages[*].content}}`	LLM Span的所有输入消息内容（换行连接）
`{{meta.input.messages[0].content}}`	第一条消息（通常为系统提示）
`{{meta.output.messages[*].content}}`	助手响应
`{{meta.input.documents}}`	检索到的文档（RAG）——JSON序列化
`{{meta.metadata.<key>}}`	自定义元数据字段
`{{meta.tool_definitions}}`	可用工具——JSON数组
`{{*}}`	整个Span的紧凑JSON（调试/回退兜底）

Trace-scope (

eval_scope: trace

)

Trace范围（

eval_scope: trace

）

Pattern	What you get
`{{spans}}`	JSON of every span in the trace
`{{spans[N].meta.input.value}}`	Single span by index — `spans[0]` is the trace root
`{{spans[*].name}}`	All span names in order, newline-joined
`{{spans[*].meta.output.value}}`	All spans' outputs, newline-joined (handy for "final answer = last output")
`{{spans[name:my-span].meta.input.value}}`	Filter by span name
`{{spans[meta.span.kind:llm].meta.output.value}}`	All LLM-kind span outputs
`{{spans[meta.span.kind:tool]}}`	Whole tool spans as JSON, paired in/out — useful for tool-use correctness
`{{spans[meta.span.kind:retrieval].meta.output.documents[*].text}}`	Text of every retrieved document — useful for RAG faithfulness
`{{*}}`	Entire trace payload as JSON (debug fallback)

模式	获取内容
`{{spans}}`	Trace中所有Span的JSON
`{{spans[N].meta.input.value}}`	通过索引选择单个Span—— `spans[0]` 是Trace根Span
`{{spans[*].name}}`	所有Span名称按顺序排列，换行连接
`{{spans[*].meta.output.value}}`	所有Span的输出，换行连接（适用于“最终答案=最后一个输出”）
`{{spans[name:my-span].meta.input.value}}`	按Span名称过滤
`{{spans[meta.span.kind:llm].meta.output.value}}`	所有LLM类型Span的输出
`{{spans[meta.span.kind:tool]}}`	完整的工具Span JSON，包含输入/输出——适用于工具使用正确性
`{{spans[meta.span.kind:retrieval].meta.output.documents[*].text}}`	所有检索到的文档文本——适用于RAG忠实度
`{{*}}`	整个Trace负载的JSON（调试回退）

Array selector syntax (applies to both scopes)

数组选择器语法（适用于两种范围）

```
[N]
```
— index (0-based)
```
[START,END]
```
— inclusive range,
```
END
```
is clamped to slice length
```
[*]
```
— wildcard (fan-out over all elements)

[field.path:value]

— filter array elements by a nested field equality, e.g.

messages[role:user]

spans[meta.span.kind:tool]

Resolution rules to keep in mind when writing prompts:

Arrays of strings → newline-joined
Arrays of objects / mixed values → compact JSON
Single empty slice → empty string
Implicit fan-out:
```
messages.content
```
behaves the same as
```
messages[*].content
```
Negative indices are not supported (parse error) — use
```
[N]
```
with a known index, or
```
[*]
```
for "last assistant turn" semantics

When to pick which form:

Generic span evaluator (e.g.
```
tone_check
```
,
```
output_format
```
) → use
```
{{span_input}}
```
/
```
{{span_output}}
```
so it works across span kinds.
LLM-span-specific evaluator (e.g.
```
system_prompt_adherence
```
) → reach for explicit
```
meta.input.messages[*].content
```
/
```
meta.output.messages[*].content
```
so you can split system vs. user vs. assistant turns.
Span-scope RAG evaluator (single retrieval+generation span) → combine
```
{{meta.input.documents}}
```
with
```
{{span_output}}
```
.
Trace-scope evaluator → see "Trace-scope evaluator examples" below for the four canonical patterns (goal completion, tool-use correctness, RAG faithfulness, conversation quality).
Metadata-aware evaluator → reference
```
{{meta.metadata.<key>}}
```
directly.

If the user has existing custom evaluators in the same ml_app (Phase 0 coverage map), match their convention when there is no strong reason to deviate.

```
[N]
```
— 索引（从0开始）
```
[START,END]
```
— 包含范围，
```
END
```
被钳制为切片长度
```
[*]
```
— 通配符（遍历所有元素）

[field.path:value]

— 按嵌套字段相等性过滤数组元素，例如

messages[role:user]

或

spans[meta.span.kind:tool]

编写提示时需记住的解析规则：

字符串数组→换行连接
对象数组/混合值→紧凑JSON
单个空切片→空字符串
隐式遍历：
```
messages.content
```
与
```
messages[*].content
```
行为相同
不支持负索引（解析错误）——使用
```
[N]
```
指定已知索引，或使用
```
[*]
```
实现“最后一个助手轮次”语义

何时选择哪种形式：

通用Span评估器（例如
```
tone_check
```
、
```
output_format
```
）→ 使用
```
{{span_input}}
```
/
```
{{span_output}}
```
，使其跨Span类型工作。
LLM Span特定评估器（例如
```
system_prompt_adherence
```
）→ 使用显式的
```
meta.input.messages[*].content
```
/
```
meta.output.messages[*].content
```
，以便区分系统/用户/助手轮次。
Span范围RAG评估器（单个检索+生成Span）→ 组合
```
{{meta.input.documents}}
```
和
```
{{span_output}}
```
。
Trace范围评估器→ 请参阅下面的“Trace范围评估器示例”，了解四个规范模式（目标完成、工具使用正确性、RAG忠实度、对话质量）。
元数据感知评估器→ 直接引用
```
{{meta.metadata.<key>}}
```
。

如果同一ml_app中存在现有自定义评估器（阶段0覆盖范围映射），在没有充分理由偏离时匹配其约定。

Trace-scope evaluator examples

Trace范围评估器示例

Concrete user-prompt bodies for the four canonical trace-scope use cases, drawn from the public docs (Trace-Level Evaluations). Each goes alongside a static System prompt that describes the rubric (no placeholders).

Use case	`filter`	User prompt body
Goal completion — agent finished the user's request	`@parent_id:undefined @meta.span.kind:agent`	`User goal:\n{{spans[0].meta.input.value}}\n\nAgent steps:\n{{spans}}`
Tool-use correctness — right tool with right arguments	`@parent_id:undefined @meta.span.kind:agent`	`User question:\n{{spans[0].meta.input.value}}\n\nTool calls:\n{{spans[meta.span.kind:tool].meta.input.parameters}}\n\nFinal response:\n{{spans[*].meta.output.value}}`
RAG faithfulness — answer grounded in retrieved docs	`@parent_id:undefined`	`Retrieved context:\n{{spans[meta.span.kind:retrieval].meta.output.documents[*].text}}\n\nFinal answer:\n{{spans[meta.span.kind:llm].meta.output.value}}`
Conversation quality — coherence and consistency across turns	`@parent_id:undefined`	`Conversation:\n{{spans[meta.span.kind:llm].meta.input.messages[].content}}\n\nAssistant responses:\n{{spans[meta.span.kind:llm].meta.output.messages[].content}}`

Use these as starting points. Adapt the

filter

and span paths to the actual span names / kinds the app emits (observed during Phase 1).

output_schema
wrapper format (required for all providers)

The

output_schema

field is NOT a bare JSON Schema. It must use the OpenAI

json_schema

object shape. name
is a fixed type discriminator, not the evaluator name — the UI validates it against a strict allowlist and rejects any other value:

LLMJudge type	`name` value	property key inside `schema`
Boolean	`"boolean_eval"`	`boolean_eval`
Score	`"score_eval"`	`score_eval`
Categorical	`"categorical_eval"`	`categorical_eval`

The property key inside

schema.properties

must match

name

exactly. The

required

array may only be

["<type_key>"]

["<type_key>", "reasoning"]

— any other value is rejected. Always include

"reasoning": {"type": "string"}

for UI display.

Boolean (

BooleanStructuredOutput(pass_when=True)

json

{
  "output_schema": {
    "name": "boolean_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "boolean_eval": {"type": "boolean", "description": "Whether the criterion is met"},
        "reasoning": {"type": "string", "description": "Explanation for the evaluation"}
      },
      "required": ["boolean_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"pass_when": true}
}

Score (

ScoreStructuredOutput(min_score=1, max_score=10, min_threshold=7)

json

{
  "output_schema": {
    "name": "score_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "score_eval": {"type": "number", "description": "Score from 1 to 10", "minimum": 1, "maximum": 10},
        "reasoning": {"type": "string", "description": "Explanation for the score"}
      },
      "required": ["score_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"min_threshold": 7}
}

Add

max_threshold

assessment_criteria

if set.

Categorical (

CategoricalStructuredOutput(categories={...}, pass_values=[...])

json

{
  "output_schema": {
    "name": "categorical_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "categorical_eval": {
          "type": "string",
          "anyOf": [
            {"const": "correct", "description": "The response correctly answers the question"},
            {"const": "partially_correct", "description": "Partially correct but missing information"},
            {"const": "incorrect", "description": "The response is wrong or irrelevant"}
          ]
        },
        "reasoning": {"type": "string", "description": "Explanation for the category chosen"}
      },
      "required": ["categorical_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"pass_values": ["correct"]}
}

Note: categorical uses

"type": "string"

alongside

anyOf

(each

const

is a string value), unlike the offline SDK which uses bare

anyOf

at the property root.

Custom / multi-dimensional: not directly supported via the fixed-name schema. Implement as a score or categorical evaluator where possible, or split into multiple evaluators. The

name

must be one of the three fixed values above.

Filter scoping: when the proposal targets a specific span kind (e.g. an LLM sub-span), translate it into an EVP

filter

query — e.g.

@meta.span.kind:llm

service:checkout-agent

, or a more specific tag. Combine with

root_spans_only:true

only when the target is the trace root.

For
eval_scope: trace
:

The evaluator triggers once per completed trace, after a 3-minute inactivity window. Late-arriving spans (>3 min after the prior span on the same trace) are excluded from the evaluation. Surface this in the proposal so the user knows about both the latency and the potential miss for sparse-activity agents (long-running agents whose steps are sparser than 3 minutes apart).
The
```
filter
```
query must match the trace's root span only — always include
```
@parent_id:undefined
```
(or
```
root_spans_only: true
```
) to avoid double-firing across descendants. Combine with
```
@meta.span.kind:agent
```
(or whatever kind the app uses for root spans, observed in Phase 1) for narrowing.
Sampling at trace scope is heavier than at span scope (one trace = many spans on the judge's side). Default
```
sampling_percentage
```
to 5
for trace-scope evaluators (instead of the span default
```
10
```
); the user can raise it after a manual review pass.

四个规范Trace范围用例的具体用户提示主体，来自公开文档（Trace-Level Evaluations）。每个提示主体都配有描述评估准则的静态系统提示（无占位符）。

用例	`filter`	用户提示主体
目标完成 — 代理是否完成了用户请求	`@parent_id:undefined @meta.span.kind:agent`	`用户目标:\n{{spans[0].meta.input.value}}\n\n代理步骤:\n{{spans}}`
工具使用正确性 — 是否使用了正确的工具和参数	`@parent_id:undefined @meta.span.kind:agent`	`用户问题:\n{{spans[0].meta.input.value}}\n\n工具调用:\n{{spans[meta.span.kind:tool].meta.input.parameters}}\n\n最终响应:\n{{spans[*].meta.output.value}}`
RAG忠实度 — 回答是否基于检索到的文档	`@parent_id:undefined`	`检索到的上下文:\n{{spans[meta.span.kind:retrieval].meta.output.documents[*].text}}\n\n最终答案:\n{{spans[meta.span.kind:llm].meta.output.value}}`
对话质量 — 多轮对话的连贯性和一致性	`@parent_id:undefined`	`对话:\n{{spans[meta.span.kind:llm].meta.input.messages[].content}}\n\n助手响应:\n{{spans[meta.span.kind:llm].meta.output.messages[].content}}`

将这些作为起点。根据阶段1中观察到的应用实际Span名称/类型，调整

filter

和Span路径。

output_schema
包装格式（所有提供商必填）

output_schema

字段不是裸JSON Schema。必须使用OpenAI

json_schema

对象格式。name
是固定类型判别符，而非评估器名称——UI会针对严格允许列表进行验证，拒绝任何其他值：

LLMJudge类型	`name` 值	`schema` 内的属性键
布尔值	`"boolean_eval"`	`boolean_eval`
分数	`"score_eval"`	`score_eval`
分类	`"categorical_eval"`	`categorical_eval`

schema.properties

内的属性键必须与

name

完全匹配。

required

数组只能是

["<type_key>"]

或

["<type_key>", "reasoning"]

——任何其他值都会被拒绝。始终包含

"reasoning": {"type": "string"}

用于UI显示。

布尔值（

BooleanStructuredOutput(pass_when=True)

）：

json

{
  "output_schema": {
    "name": "boolean_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "boolean_eval": {"type": "boolean", "description": "是否符合标准"},
        "reasoning": {"type": "string", "description": "评估解释"}
      },
      "required": ["boolean_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"pass_when": true}
}

分数（

ScoreStructuredOutput(min_score=1, max_score=10, min_threshold=7)

）：

json

{
  "output_schema": {
    "name": "score_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "score_eval": {"type": "number", "description": "1到10分", "minimum": 1, "maximum": 10},
        "reasoning": {"type": "string", "description": "评分解释"}
      },
      "required": ["score_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"min_threshold": 7}
}

如果设置了

max_threshold

，将其添加到

assessment_criteria

中。

分类（

CategoricalStructuredOutput(categories={...}, pass_values=[...])

）：

json

{
  "output_schema": {
    "name": "categorical_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "categorical_eval": {
          "type": "string",
          "anyOf": [
            {"const": "correct", "description": "响应正确回答了问题"},
            {"const": "partially_correct", "description": "部分正确但缺少信息"},
            {"const": "incorrect", "description": "响应错误或无关"}
          ]
        },
        "reasoning": {"type": "string", "description": "类别选择解释"}
      },
      "required": ["categorical_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"pass_values": ["correct"]}
}

注意：分类使用

"type": "string"

和

anyOf

（每个

const

是字符串值），与离线SDK不同，离线SDK在属性根级别使用裸

anyOf

。

自定义/多维：无法通过固定名称schema直接支持。尽可能实现为分数或分类评估器，或拆分为多个评估器。

name

必须是上述三个固定值之一。

过滤范围：当提案针对特定Span类型（例如LLM子Span）时，将其转换为EVP

filter

查询——例如

@meta.span.kind:llm

、

service:checkout-agent

或更具体的标签。仅当目标是Trace根Span时，才结合

root_spans_only:true

。

对于
eval_scope: trace
：

评估器在每个完成的Trace上触发一次，等待3分钟无活动窗口。同一Trace中前一个Span后超过3分钟到达的Span会被排除在评估之外。在提案中说明这一点，以便用户了解延迟和稀疏活动代理的潜在遗漏（步骤间隔超过3分钟的长运行代理）。
```
filter
```
查询必须仅匹配Trace的根Span——始终包含
```
@parent_id:undefined
```
（或
```
root_spans_only: true
```
），避免在后代Span上重复触发。结合
```
@meta.span.kind:agent
```
（或阶段1中观察到的应用根Span类型）进行缩小范围。
Trace范围的采样比Span范围更重（一个Trace=Judge侧的多个Span）。Trace范围评估器的默认
```
sampling_percentage
```
为**
```
5
```
**（而非Span范围的默认
```
10
```
）；用户在手动审查后可以提高该值。

Always publish as draft (

enabled: false

)

始终以草稿形式发布（

enabled: false

）

Always create / update evaluators with
enabled: false
— regardless of whether

integration_account_id

was auto-detected from existing evaluators. The UI is the source of truth for activation; the skill should never auto-enable evaluators on the user's behalf. The user reviews each draft in the UI, confirms the integration account is correct (the auto-detected ID may belong to a different judge LLM than the one they want for this app), and flips the toggle when they're satisfied.

This makes the workflow safe by default: a wrong

integration_account_id

, a mistuned prompt, or an over-broad filter never goes live without a human pass. Auto-detection of the account ID still helps because the draft renders with the right account pre-selected — review is faster.

始终以
enabled: false
创建/更新评估器——无论是否从现有评估器自动检测到

integration_account_id

。UI是激活的权威来源；技能绝不能代表用户自动启用评估器。用户在UI中审查每个草稿，确认集成账户正确（自动检测的ID可能属于与用户为此应用所需不同的Judge LLM），并在满意后切换开关。

这使工作流默认安全：错误的

integration_account_id

、调整不当的提示或过于宽泛的过滤器在没有人工检查的情况下永远不会生效。账户ID的自动检测仍然有用，因为草稿会预先选择正确的账户——审查更快。

integration_account_id resolution

integration_account_id解析

The

integration_account_id

is an opaque UUID that the UI matches against the org's integration accounts list to populate the account section dropdown. Users typically don't know this value, so never ask the user to supply a raw UUID.

Resolution order:

Inherit from existing evaluators — in Phase 0 you called
```
get_llmobs_evaluator
```
for each existing custom evaluator. Check the
```
llm_provider.integration_account_id
```
field on those responses. If any of them have a value, use that same ID on the published drafts. If multiple different IDs appear across existing evaluators, pick the most common one and note which you chose so the user can correct it during the UI review pass.
Omit if no existing evaluator has one — if no custom evaluator in the ml_app has an
```
integration_account_id
```
, omit the field from the publish payload. The draft will render without an account pre-selected; the user picks one during the UI review pass before activating.

Either way, the evaluator is published with

enabled: false

. The user is the gate — see "Always publish as draft" above.

integration_account_id

是不透明的UUID，UI会将其与组织的集成账户列表匹配，以填充账户部分的下拉菜单。用户通常不知道此值，因此切勿要求用户提供原始UUID。

解析顺序：

从现有评估器继承——在阶段0中，你调用了
```
get_llmobs_evaluator
```
获取每个现有自定义评估器。检查这些响应中的
```
llm_provider.integration_account_id
```
字段。如果其中任何一个有值，在发布的草稿中使用相同的ID。如果现有评估器中出现多个不同的ID，选择最常见的一个并说明你选择了哪个，以便用户在UI审查过程中更正。
如果没有现有评估器包含该ID则省略——如果ml_app中的自定义评估器都没有
```
integration_account_id
```
，从发布负载中省略该字段。草稿将在未预先选择账户的情况下呈现；用户在激活前的UI审查过程中选择一个账户。

无论哪种情况，评估器都以

enabled: false

发布。用户是把关人——请参阅上面的“始终以草稿形式发布”。

Publish (single message — parallelize)

发布（单个消息——并行化）

Issue all

create_or_update_llmobs_evaluator

calls in a single message (one per evaluator). Set

telemetry.intent

to a short English description like

"skill:llm-obs-eval-bootstrap — Bootstrap evaluator suite for ml_app=<ml_app> from production trace analysis."

If any call fails, capture the error and continue with the remaining evaluators — never silently abort the batch. Report failures explicitly in the summary.

在单个消息中发起所有

create_or_update_llmobs_evaluator

调用（每个评估器对应一次调用）。将

telemetry.intent

设置为简短的英文描述，例如

"skill:llm-obs-eval-bootstrap — Bootstrap evaluator suite for ml_app=<ml_app> from production trace analysis."

。

如果任何调用失败，捕获错误并继续处理剩余评估器——切勿静默中止批量操作。在摘要中明确报告失败情况。

Summary

摘要

undefined

undefined

Published Evaluators (drafts — pending UI review)

已发布评估器（草稿——待UI审查）

Wrote {N} online evaluators to ml_app

{ml_app}

. All published as drafts (
enabled: false
) — review and activate them in the UI before they start scoring spans.

#	Name	Action	Provider/Model	Sampling	Scope	Account auto-detected	Status
1	task_completion	created (draft)	openai/gpt-5.4-mini	10%	span	yes	ok
2	response_groundedness	overwrote (draft)	openai/gpt-5.4-mini	10%	span	yes	ok
3	scope_adherence	renamed ( `_v2` ) (draft)	openai/gpt-5.4-mini	10%	span	no — pick in UI	ok
4	citation_format	failed	openai/gpt-5.4-mini	10%	span	—	error

{If any failed:} Errors:

```
{name}
```
: {error message}

{If any code-based proposals were dropped:} Not published (code-based, not supported by online evaluator API):

```
{name}
```
({type}) — consider running offline via
```
/eval-bootstrap {ml_app}
```
(SDK mode).

已将{N}个在线评估器写入ml_app

{ml_app}

。所有评估器均以草稿形式发布（
enabled: false
）——在它们开始为Span评分前，在UI中审查并激活它们。

#	名称	操作	提供商/模型	采样率	范围	账户自动检测	状态
1	task_completion	创建（草稿）	openai/gpt-5.4-mini	10%	span	是	成功
2	response_groundedness	覆盖（草稿）	openai/gpt-5.4-mini	10%	span	是	成功
3	scope_adherence	重命名（ `_v2` ）（草稿）	openai/gpt-5.4-mini	10%	span	否——在UI中选择	成功
4	citation_format	失败	openai/gpt-5.4-mini	10%	span	—	错误

{如果有失败情况：} 错误:

```
{name}
```
: {错误消息}

{如果有基于代码的提案被丢弃：} 未发布（基于代码，在线评估器API不支持）：

```
{name}
```
({type}) — 考虑通过
```
/eval-bootstrap {ml_app}
```
（SDK模式）离线运行。

Next Steps — review and activate in the UI

后续步骤——在UI中审查并激活

The drafts are intentionally not running yet. Walk through each one in the Datadog UI before flipping the enable toggle:

Open the drafts: Datadog → LLM Observability → Evaluations → filter by ml_app
```
{ml_app}
```
(the new drafts appear with status
```
Disabled
```
).
For each draft:
- Verify the integration account in the Provider section. If the column above shows
```
auto-detected: yes
```
  , confirm it's the correct account for the judge LLM you want this evaluator to call through. If
```
no
```
  , pick an account from the dropdown.
- Skim the prompt template and the structured-output schema — make sure the spans-vs-trace scope, filter, and sampling match what you actually want to measure.
- Click into a sample span/trace and use the test pane to dry-run the prompt against real data. Confirm the result matches your expectation.
Enable: once each draft passes review, toggle it to enabled. Datadog starts scoring incoming spans immediately.
Wait for first scores: with
```
sampling_percentage=10
```
(span scope) or
```
5
```
(trace scope), expect first results within minutes for high-traffic apps.
Tune sampling/filter: if results are noisy or volume is too high, reduce
```
sampling_percentage
```
or tighten the
```
filter
```
from the UI. Re-running
```
/eval-bootstrap {ml_app} --publish
```
will round-trip the existing config before overwriting — your manual tweaks survive across reruns.

undefined

草稿目前未运行。在切换启用开关前，在Datadog UI中逐个检查：

打开草稿：Datadog → LLM Observability → Evaluations → 按ml_app
```
{ml_app}
```
过滤（新草稿显示为
```
Disabled
```
状态）。
针对每个草稿:
- 验证集成账户：在提供商部分。如果上面的列显示
```
自动检测：是
```
  ，确认它是你希望此评估器调用的Judge LLM的正确账户。如果显示
```
否
```
  ，从下拉菜单中选择一个账户。
- 浏览提示模板和结构化输出schema——确保Span/Trace范围、过滤器和采样率与你实际要测量的内容匹配。
- 点击示例Span/Trace并使用测试窗格针对真实数据试运行提示。确认结果符合你的预期。
启用：每个草稿通过审查后，切换为启用状态。Datadog立即开始为传入Span评分。
等待首次评分：对于
```
sampling_percentage=10
```
（Span范围）或
```
5
```
（Trace范围），高流量应用预计几分钟内会出现首次结果。
调整采样/过滤器：如果结果嘈杂或流量过高，从UI中降低
```
sampling_percentage
```
或收紧
```
filter
```
。重新运行
```
/eval-bootstrap {ml_app} --publish
```
会在覆盖前往返现有配置——你的手动调整会在重新运行后保留。

undefined

Notebook export (after summary)

Notebook导出（摘要后）

Same logic as Phase 3A — offer to append to the RCA notebook if

rca_notebook_url

was detected, or create a new standalone notebook. The notebook cell should list the published evaluators with their UI links and the

ml_app

they target. In pup mode, use

pup notebooks create

pup notebooks edit

as described in Phase 3A.

与阶段3A逻辑相同——如果检测到

rca_notebook_url

，提供附加到RCA Notebook的选项，否则创建新的独立Notebook。Notebook单元格应列出已发布的评估器及其UI链接和目标ml_app。在pup模式下，按照阶段3A中的描述使用

pup notebooks create

pup notebooks edit

。

Operating Rules

操作规则

Breadth over precision; let the user curate: Propose 8–15 evaluators distributed across domain-specific (largest bucket — derived from Phase 1 domain signals), outcome, format, and safety. Users can always remove what doesn't fit their quality bar; they cannot easily add what was not proposed. Anchor every domain-specific proposal in at least one observed trace pattern — don't invent generic domain evaluators without evidence.
Don't overfit: Write criteria that generalize beyond the specific sampled traces. Use examples as grounding, not as the sole criteria.
Show your work: Every proposed evaluator cites at least one trace as evidence with a clickable link:
```
[Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id})
```
.
New file only: Never modify existing evaluator code or experiment configurations.
Honest about uncertainty: If fewer than 5 traces support a proposed evaluator, flag it as tentative.

广度优先于精度，让用户筛选：提出8-15个评估器，分布在领域特定（最大类别——来自阶段1的领域信号）、结果、格式和安全类中。用户始终可以移除不符合其质量标准的评估器；但他们无法轻松添加未提出的评估器。每个领域特定提案都必须至少有一个观察到的Trace模式作为基础——不要在没有证据的情况下发明通用领域评估器。
不要过度拟合：编写的标准应超出特定采样Trace的范围。使用示例作为基础，而非唯一标准。
展示工作过程：每个拟议评估器至少引用一个Trace作为证据，并提供可点击链接：
```
[Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id})
```
。
仅创建新文件：切勿修改现有评估器代码或实验配置。
诚实面对不确定性：如果支持拟议评估器的Trace少于5个，标记为暂定。

Tool Reference

工具参考

This appendix applies only in pup mode. In MCP mode, use the tool names in the workflow sections directly.

本附录仅适用于pup模式。在MCP模式下，直接使用工作流章节中的工具名称。

Spans and traces

Span和Trace

MCP Tool	pup Command
`search_llmobs_spans(query, ml_app, from, to, limit, cursor, root_spans_only, span_kind, summary)`	`pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--cursor C] [--root-spans-only] [--span-kind K] [--summary]` — always use `--query "@ml_app:A"` to filter by ml_app; the `--ml-app A` flag is unreliable and silently returns spans from other apps.
`get_llmobs_span_details(trace_id, span_ids, from, to)`	`pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...`
`get_llmobs_span_content(trace_id, span_id, field, path)`	`pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]`
`get_llmobs_trace(trace_id, include_tree)`	`pup llm-obs spans get-trace --trace-id T [--include-tree]`
`get_llmobs_agent_loop(trace_id, span_id)`	`pup llm-obs spans get-agent-loop --trace-id T [--span-id S]`
`find_llmobs_error_spans(trace_id)`	`pup llm-obs spans find-errors --trace-id T`
`expand_llmobs_spans(trace_id, span_ids, max_depth, filter_kind)`	`pup llm-obs spans expand --trace-id T --span-ids S1,S2,... [--max-depth N] [--filter-kind K]`

MCP工具	pup命令
`search_llmobs_spans(query, ml_app, from, to, limit, cursor, root_spans_only, span_kind, summary)`	`pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--cursor C] [--root-spans-only] [--span-kind K] [--summary]` — 始终使用 `--query "@ml_app:A"` 按ml_app过滤； `--ml-app A` 标志不可靠，会静默返回其他应用的Span。
`get_llmobs_span_details(trace_id, span_ids, from, to)`	`pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...`
`get_llmobs_span_content(trace_id, span_id, field, path)`	`pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]`
`get_llmobs_trace(trace_id, include_tree)`	`pup llm-obs spans get-trace --trace-id T [--include-tree]`
`get_llmobs_agent_loop(trace_id, span_id)`	`pup llm-obs spans get-agent-loop --trace-id T [--span-id S]`
`find_llmobs_error_spans(trace_id)`	`pup llm-obs spans find-errors --trace-id T`
`expand_llmobs_spans(trace_id, span_ids, max_depth, filter_kind)`	`pup llm-obs spans expand --trace-id T --span-ids S1,S2,... [--max-depth N] [--filter-kind K]`

Evaluators

评估器

MCP Tool	pup Command
`list_llmobs_evals()`	`pup llm-obs evals list` (filter by `ml_app` client-side)
`list_llmobs_evals_by_ml_app(ml_app)`	`pup llm-obs evals list-by-ml-app --ml-app A`
`get_llmobs_evaluator(eval_name)`	`pup llm-obs evals get-evaluator EVAL_NAME`
`get_llmobs_eval_aggregate_stats(eval_name, ml_app, from, to)`	`pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]`
`delete_llmobs_evaluator(eval_name)`	`pup llm-obs evals delete EVAL_NAME`
`create_or_update_llmobs_evaluator(...)`	`pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json` — see flat schema note below

MCP工具	pup命令
`list_llmobs_evals()`	`pup llm-obs evals list` （在客户端按 `ml_app` 过滤）
`list_llmobs_evals_by_ml_app(ml_app)`	`pup llm-obs evals list-by-ml-app --ml-app A`
`get_llmobs_evaluator(eval_name)`	`pup llm-obs evals get-evaluator EVAL_NAME`
`get_llmobs_eval_aggregate_stats(eval_name, ml_app, from, to)`	`pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]`
`delete_llmobs_evaluator(eval_name)`	`pup llm-obs evals delete EVAL_NAME`
`create_or_update_llmobs_evaluator(...)`	`pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json` — 请参阅下面的扁平schema说明

create_or_update_llmobs_evaluator

in pup mode

pup模式下的

create_or_update_llmobs_evaluator

pup uses a flat JSON file (all fields top-level).

get-evaluator

returns a nested object. Transform as follows:

Round-trip check: Call
```
pup llm-obs evals get-evaluator EVAL_NAME
```
first. If it exists, start from its config.

Flatten
llm_provider
: hoist

integration_provider

model_name

integration_account_id

temperature

to top level, dropping the

llm_provider

key.

Merge and set
enabled: false
.
Write to temp file and call:
bash
```
pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json
```
Use unique temp file names when publishing multiple evaluators in parallel (e.g.
```
/tmp/eval_toxicity.json
```
).

`get-evaluator` field	Flat JSON key
`llm_provider.integration_provider`	`integration_provider`
`llm_provider.model_name`	`model_name`
`llm_provider.integration_account_id`	`integration_account_id`
`llm_provider.temperature`	`temperature`
All other fields	Unchanged (already top-level)

pup使用扁平JSON文件（所有字段均为顶级）。

get-evaluator

返回嵌套对象。转换方式如下：

往返检查：首先调用
```
pup llm-obs evals get-evaluator EVAL_NAME
```
。如果存在，从其配置开始。

扁平化
llm_provider
：将

integration_provider

、

model_name

、

integration_account_id

、

temperature

提升到顶级，删除

llm_provider

键。

合并并设置
enabled: false
。
写入临时文件并调用:
bash
```
pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json
```
并行发布多个评估器时使用唯一的临时文件名（例如
```
/tmp/eval_toxicity.json
```
）。

`get-evaluator` 字段	扁平JSON键
`llm_provider.integration_provider`	`integration_provider`
`llm_provider.model_name`	`model_name`
`llm_provider.integration_account_id`	`integration_account_id`
`llm_provider.temperature`	`temperature`
所有其他字段	不变（已为顶级）

Notebooks

Notebook

MCP Tool pup Command

MCP Tool	pup Command
`create_datadog_notebook(name, cells, ...)`	`pup notebooks create --title "TITLE" --file /tmp/nb_cells.json` — confirm exact flags with `pup notebooks create --help`
`edit_datadog_notebook(id, cells, append_only=true)`	`pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json` (fetches current notebook, appends provided cells, writes back)

create_datadog_notebook(name, cells, ...)

pup notebooks create --title "TITLE" --file /tmp/nb_cells.json

— confirm exact flags with

pup notebooks create --help

edit_datadog_notebook(id, cells, append_only=true)

pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json

(fetches current notebook, appends provided cells, writes back)

The cells file is a JSON array of cell objects:

json

[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]

MCP result parsing safety: Before writing any script (Python, jq, etc.) that iterates over or accesses fields in an MCP tool result, inspect the raw structure first — check
```
type(result)
```
, top-level keys, and whether the payload is nested inside a content block (e.g.
```
[{'type': 'text', 'text': '<json>'}]
```
). Extract and
```
json.loads()
```
the inner payload if needed before parsing. Never assume MCP results are bare dicts or lists.

MCP工具 pup命令

MCP工具	pup命令
`create_datadog_notebook(name, cells, ...)`	`pup notebooks create --title "TITLE" --file /tmp/nb_cells.json` — 使用 `pup notebooks create --help` 确认确切标志
`edit_datadog_notebook(id, cells, append_only=true)`	`pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json` （获取当前Notebook，附加提供的单元格，写回）

create_datadog_notebook(name, cells, ...)

pup notebooks create --title "TITLE" --file /tmp/nb_cells.json

— 使用

pup notebooks create --help

确认确切标志

edit_datadog_notebook(id, cells, append_only=true)

pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json

（获取当前Notebook，附加提供的单元格，写回）

单元格文件是单元格对象的JSON数组：

json

[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]

MCP结果解析安全：在编写任何迭代或访问MCP工具结果中字段的脚本（Python、jq等）之前，先检查原始结构——检查
```
type(result)
```
、顶级键，以及负载是否嵌套在内容块中（例如
```
[{'type': 'text', 'text': '<json>'}]
```
）。如果需要，提取并
```
json.loads()
```
内部负载后再解析。切勿假设MCP结果是裸字典或列表。

llm-obs-eval-bootstrap

Original

Translation

Backend

后端

Eval Bootstrap — Generate Evaluators from Production Traces

评估器引导——从生产Trace生成评估器

Usage

使用方法

Inputs

输入项

Available Tools

可用工具

Key get_llmobs_span_content Patterns

get_llmobs_span_content关键使用模式

How to Use search_llmobs_spans

search_llmobs_spans使用方法

Parallelization Rules

并行化规则

Evaluator SDK Reference

评估器SDK参考

Imports

导入

Core classes

核心类

LLM-as-judge

LLM作为Judge

Built-in evaluators (use only if needed)

内置评估器（仅在需要时使用）

EvaluatorContext (what evaluate() receives)

EvaluatorContext（evaluate()接收的参数）

EvaluatorResult (what evaluate() returns)

EvaluatorResult（evaluate()返回的结果）

LLMJudge — LLM-as-Judge Evaluator

LLMJudge——LLM作为Judge的评估器

Structured Output Types

结构化输出类型

Pass a raw dict as structured_output — used as the JSON schema directly

传递原始字典作为structured_output——直接用作JSON schema

LLMJudge Prompt Guidelines

LLMJudge提示准则

BaseEvaluator — Custom Code-Based Evaluator

BaseEvaluator——基于自定义代码的评估器

Built-in Evaluators

内置评估器

Validate JSON syntax + optional required keys

验证JSON语法 + 可选必填键

Validate length (characters, words, or lines)

验证长度（字符、单词或行数）

count_by: "characters" | "words" | "lines"

count_by: "characters" | "words" | "lines"

String matching

字符串匹配

operation: "eq" | "ne" | "contains" | "icontains"

operation: "eq" | "ne" | "contains" | "icontains"

Regex matching

正则匹配

match_mode: "search" | "match" | "fullmatch"

match_mode: "search" | "match" | "fullmatch"

Evaluator Type Decision Matrix

评估器类型决策矩阵

Source Verification

源码验证

Workflow

工作流

Phase 0: Resolve Inputs & Entry Mode

阶段0：解析输入与确定进入模式

Phase 1: Explore Traces & Identify Eval Targets

阶段1：探索Trace并确定评估目标

Cold Start Path

冷启动路径

From RCA Path

来自RCA的路径

Phase 2: Propose Evaluator Suite

阶段2：提出评估器套件

How many evaluators to propose

要提出多少个评估器

Deduplication Against Existing Coverage

与现有覆盖范围去重

Span vs. Trace Scope Classification (publish mode)

Key
`get_llmobs_span_content`
Patterns

`get_llmobs_span_content`
关键使用模式

How to Use
`search_llmobs_spans`

`search_llmobs_spans`
使用方法

EvaluatorContext (what
`evaluate()`
receives)

EvaluatorContext（
`evaluate()`
接收的参数）

EvaluatorResult (what
`evaluate()`
returns)

EvaluatorResult（
`evaluate()`
返回的结果）

Span vs. Trace Scope Classification (
`publish`
mode)

Span-scope (
`eval_scope: span`
)

Span范围（
`eval_scope: span`
）

Trace-scope (
`eval_scope: trace`
)

Trace范围（
`eval_scope: trace`
）