llm-obs-eval-bootstrap

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Backend

后端

Detection — At the start of every invocation, before taking any action, determine which backend to use:
  1. If the user passed
    --backend pup
    anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
  2. Check whether MCP tools are present in your active tool list. The canonical signal is whether
    mcp__datadog-llmo-mcp__list_llmobs_evals
    appears in your available tools.
  3. If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in this skill's workflow sections.
  4. If MCP tools are absent → check whether
    pup
    is executable: run
    pup --version
    via Bash. A JSON response containing
    "version"
    confirms pup is available.
  5. If pup responds → use pup mode throughout. Translate every MCP tool call to its pup equivalent using the Tool Reference appendix at the bottom of this file.
  6. If neither is available → stop and tell the user:
    "Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server (
    claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
    ) or install pup."
--backend pup
is accepted anywhere in the invocation arguments and is stripped before passing remaining args to the skill logic.
pup invocation rules:
  • Invoke via Bash:
    pup llm-obs <subcommand> [flags]
  • pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results, which may wrap JSON in
    [{"type": "text", "text": "<json>"}]
    ).
  • If pup returns an auth error, tell the user to run
    pup auth login
    and stop.
  • Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
  • Time flags: pup accepts bare duration strings (
    1h
    ,
    7d
    ,
    30m
    ) and RFC3339 timestamps. Do not use
    now-
    -prefixed strings — strip the prefix when converting from a skill
    --timeframe
    argument:
    now-7d
    7d
    ,
    now-24h
    24h
    ,
    now-30d
    30d
    .
  • --summary
    on
    pup llm-obs spans search
    strips payload fields to essential metadata only. Use it in bulk/search phases where content is not needed.
Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g.,
3a9f1c2b
). Keep it constant for the entire invocation.
Intent tagging: On every MCP tool call, prefix
telemetry.intent
with
skill:llm-obs-eval-bootstrap[<inv_id>] — 
followed by a description of why the tool is being called. On the first MCP tool call only, use
skill:llm-obs-eval-bootstrap:start[<inv_id>] — 
instead (note the
:start
suffix). Example first call:
skill:llm-obs-eval-bootstrap:start[3a9f1c2b] — Phase 0: map existing eval coverage for task-cruncher
检测逻辑 — 在每次调用开始、执行任何操作前,确定要使用的后端:
  1. 如果用户在调用中任何位置传入
    --backend pup
    → 立即使用pup模式,无论是否存在MCP工具。跳过步骤2-4。
  2. 检查活跃工具列表中是否存在MCP工具。标准判断信号是可用工具中是否包含
    mcp__datadog-llmo-mcp__list_llmobs_evals
  3. 如果存在MCP工具 → 全程使用MCP模式。严格按照本技能工作流章节中指定的名称调用MCP工具。
  4. 如果不存在MCP工具 → 检查
    pup
    是否可执行:通过Bash运行
    pup --version
    。返回包含
    "version"
    的JSON响应即确认pup可用。
  5. 如果pup响应正常 → 全程使用pup模式。使用本文件底部的工具参考附录,将每个MCP工具调用转换为对应的pup等效命令。
  6. 如果两者都不可用 → 停止操作并告知用户:
    "Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器(
    claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
    )或安装pup。"
--backend pup
可在调用参数的任何位置使用,在将剩余参数传递给技能逻辑前会被剥离。
pup调用规则
  • 通过Bash调用:
    pup llm-obs <subcommand> [flags]
  • pup始终输出JSON。直接解析即可——无需解包内容块(与MCP结果不同,MCP结果可能将JSON包裹在
    [{"type": "text", "text": "<json>"}]
    中)。
  • 如果pup返回认证错误,告知用户运行
    pup auth login
    并停止操作。
  • 并行化:在单个消息中发起多个Bash工具调用(每个pup命令对应一次调用)。
  • 时间参数:pup接受纯时长字符串(
    1h
    7d
    30m
    )和RFC3339时间戳。不要使用
    now-
    前缀的字符串——转换技能的
    --timeframe
    参数时需移除前缀:
    now-7d
    7d
    now-24h
    24h
    now-30d
    30d
  • pup llm-obs spans search
    中使用
    --summary
    会将负载字段精简为核心元数据。在批量/搜索阶段不需要内容时使用该参数。
调用ID:在每次调用的最开始、发起任何MCP工具调用前,生成一个8字符的十六进制调用ID(例如
3a9f1c2b
)。整个调用过程中保持该ID不变。
意图标记:在每个MCP工具调用中,将
telemetry.intent
前缀设置为
skill:llm-obs-eval-bootstrap[<inv_id>] — 
,后跟调用该工具的原因描述。仅在第一次MCP工具调用时,使用
skill:llm-obs-eval-bootstrap:start[<inv_id>] — 
(注意
:start
后缀)。示例首次调用:
skill:llm-obs-eval-bootstrap:start[3a9f1c2b] — Phase 0: map existing eval coverage for task-cruncher

Eval Bootstrap — Generate Evaluators from Production Traces

评估器引导——从生产Trace生成评估器

Given a sample of production LLM traces, analyze input/output patterns and quality dimensions, then emit a ready-to-use evaluator suite. Three output modes:
  • sdk_code
    (default) — Python
    .py
    file using the Datadog Evals SDK (
    BaseEvaluator
    /
    LLMJudge
    ) for offline experiments.
  • data_only
    — self-contained JSON spec, framework-agnostic.
  • publish
    — write online LLM-judge evaluators directly to Datadog via
    create_or_update_llmobs_evaluator
    . These run automatically on matching production spans or traces (no dataset, no task function). The skill auto-classifies each proposed evaluator as span-scoped or trace-scoped based on what the judgment requires (a per-LLM-call tone check vs. an agent goal completion that needs the whole trace) — the user accepts or overrides the classification at the proposal checkpoint.
基于生产LLM Trace样本,分析输入/输出模式和质量维度,然后生成可直接使用的评估器套件。支持三种输出模式:
  • sdk_code
    (默认)——使用Datadog Evals SDK(
    BaseEvaluator
    /
    LLMJudge
    )生成Python
    .py
    文件,用于离线实验。
  • data_only
    ——生成独立的JSON规范,与框架无关。
  • publish
    ——通过
    create_or_update_llmobs_evaluator
    直接将在线LLM-judge评估器写入Datadog。这些评估器会自动在匹配的生产Span或Trace上运行(无需数据集、无需任务函数)。技能会根据判断需求自动将每个拟议评估器分类为Span范围Trace范围(例如,每个LLM调用的语气检查需要Span范围,而代理目标完成需要整个Trace则需要Trace范围)——用户会在提案检查点接受或覆盖该分类。

Usage

使用方法

/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only | --publish]
Arguments: $ARGUMENTS
/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only | --publish]
参数:$ARGUMENTS

Inputs

输入项

InputRequiredDefaultDescription
ml_app
YesML application to scope traces
timeframe
No
now-7d
How far back to look
rca_report
NoFailure taxonomy from
eval-trace-rca
skill, or a free-text failure hypothesis
--data-only
NooffEmit a self-contained JSON spec file instead of Python SDK code
--publish
NooffPublish online LLM-judge evaluators to Datadog (mutually exclusive with
--data-only
)
If
ml_app
is missing, ask the user before proceeding. If both
--data-only
and
--publish
are supplied, error out and ask which mode the user wants.
输入项是否必填默认值描述
ml_app
用于限定Trace范围的ML应用
timeframe
now-7d
回溯时间范围
rca_report
来自
eval-trace-rca
技能的故障分类,或自由文本形式的失败假设
--data-only
关闭生成独立的JSON规范文件,而非Python SDK代码
--publish
关闭将在线LLM-judge评估器发布到Datadog(与
--data-only
互斥)
如果缺少
ml_app
,在继续前询问用户。如果同时提供
--data-only
--publish
,抛出错误并询问用户想要使用哪种模式。

Available Tools

可用工具

ToolPurpose
search_llmobs_spans
Find spans by eval presence, tags, span kind, query syntax. Paginate with cursor.
get_llmobs_span_details
Metadata, evaluations (scores, labels, reasoning), and
content_info
map showing available fields + sizes.
get_llmobs_span_content
Actual content for a span field. Supports JSONPath via
path
param for targeted extraction.
get_llmobs_trace
Full trace hierarchy as span tree with span counts by kind.
get_llmobs_agent_loop
Chronological agent execution timeline (LLM calls, tool invocations, decisions).
list_llmobs_evals
List every evaluator configured for the caller's org across all ml_apps, with
enabled
status and
ml_app
per result. Call once in Phase 0 to map existing coverage before proposing new evaluators — filter the result by
ml_app
client-side.
get_llmobs_evaluator
Fetch the full persisted evaluator config by name (target ml_app + sampling + filter, provider, prompt template, parsing type, output schema, assessment criteria). Use in Phase 0 to understand what each existing custom eval measures, and (in publish mode) before any update
create_or_update_llmobs_evaluator
is full-replace, so you must round-trip the full config to avoid clobbering fields. Not all evaluators have a stored config (notably
source=ootb
); a not-found error there is expected — skip those.
create_or_update_llmobs_evaluator
(publish mode) Write an LLM-judge evaluator config to Datadog. Full-replace semantics: any omitted optional field resets to its default. See "Publishing Conventions" for required fields and structured output → JSON schema mapping.
delete_llmobs_evaluator
(publish mode) Only used if the user explicitly asks to remove an evaluator. Never invoke speculatively.
工具用途
search_llmobs_spans
根据评估存在性、标签、Span类型、查询语法查找Span。使用游标分页。
get_llmobs_span_details
获取元数据、评估结果(分数、标签、推理过程),以及显示可用字段和大小的
content_info
映射。
get_llmobs_span_content
获取Span字段的实际内容。支持通过
path
参数使用JSONPath进行定向提取。
get_llmobs_trace
获取完整的Trace层级结构,即包含各类型Span计数的Span树。
get_llmobs_agent_loop
获取按时间顺序排列的代理执行时间线(LLM调用、工具调用、决策)。
list_llmobs_evals
列出调用者组织下所有ml_app中配置的所有评估器,包含
enabled
状态和每个结果对应的
ml_app
。在阶段0调用一次,以在提出新评估器前映射现有覆盖范围——在客户端按
ml_app
过滤结果。
get_llmobs_evaluator
通过名称获取完整的持久化评估器配置(目标ml_app + 采样 + 过滤、提供商、提示模板、解析类型、输出 schema、评估标准)。在阶段0使用,以了解每个现有自定义评估器的测量内容;在发布模式下,任何更新前都要使用该工具——
create_or_update_llmobs_evaluator
是全替换模式,因此必须往返完整配置以避免覆盖字段。并非所有评估器都有存储的配置(尤其是
source=ootb
的评估器);出现未找到错误是预期情况——跳过这些评估器。
create_or_update_llmobs_evaluator
(发布模式)将LLM-judge评估器配置写入Datadog。全替换语义:任何省略的可选字段都会重置为默认值。有关必填字段和结构化输出→JSON schema映射,请参阅“发布约定”。
delete_llmobs_evaluator
(发布模式)仅在用户明确要求移除评估器时使用。切勿推测性调用。

Key
get_llmobs_span_content
Patterns

get_llmobs_span_content
关键使用模式

Use the
path
parameter to extract targeted data without fetching full payloads:
FieldPathWhat you get
messages
$.messages[0]
System prompt (first message, usually
system
role)
messages
$.messages[-1]
Last assistant response
messages
(no path)Full conversation including tool calls
input
/
output
Span I/O
documents
Retrieved documents (RAG apps)
metadata
Custom metadata (prompt versions, feature flags, user segments)
使用
path
参数提取目标数据,无需获取完整负载:
字段路径获取内容
messages
$.messages[0]
系统提示(第一条消息,通常为
system
角色)
messages
$.messages[-1]
最后一条助手响应
messages
(无路径)包含工具调用的完整对话
input
/
output
Span输入/输出
documents
检索到的文档(RAG应用)
metadata
自定义元数据(提示版本、功能标志、用户细分)

How to Use
search_llmobs_spans

search_llmobs_spans
使用方法

Additional filters combine with space (AND):
@status:error @ml_app:my-app
. Dedicated params (
span_kind
,
root_spans_only
,
ml_app
) work alongside
query
, but
query
takes precedence over
tags
.
To find spans with a specific eval:
@evaluations.custom.<eval_name>:*
— you can only query for eval presence, not specific results.
附加过滤器使用空格组合(AND逻辑):
@status:error @ml_app:my-app
。专用参数(
span_kind
root_spans_only
ml_app
)可与
query
配合使用,但
query
优先级高于
tags
查找包含特定评估器的Span:
@evaluations.custom.<eval_name>:*
——只能查询评估器的存在性,无法查询特定结果。

Parallelization Rules

并行化规则

  1. get_llmobs_span_details
    : Group span_ids by trace_id. One call per trace_id with ALL its span_ids. Issue ALL calls for a page in a single message.
  2. get_llmobs_span_content
    : Each call is independent — always issue ALL in a single message.
  3. get_llmobs_trace
    /
    get_llmobs_agent_loop
    : Parallelize across different traces in a single message.
  4. Pipeline parallelism: Start
    get_llmobs_span_details
    for page 1 results immediately — don't wait to collect all pages.

  1. get_llmobs_span_details
    :按trace_id对span_ids进行分组。每个trace_id对应一次调用,包含其所有span_ids。在单个消息中发起某一页的所有调用。
  2. get_llmobs_span_content
    :每次调用相互独立——始终在单个消息中发起所有调用。
  3. get_llmobs_trace
    /
    get_llmobs_agent_loop
    :在单个消息中对不同Trace进行并行调用。
  4. 流水线并行化:立即为第1页结果发起
    get_llmobs_span_details
    调用——无需等待收集所有页面。

Evaluator SDK Reference

评估器SDK参考

Applies to
sdk_code
mode only.
In
data_only
mode, use this section as domain context when writing rubric prompts — no SDK classes are emitted.
仅适用于
sdk_code
模式
。在
data_only
模式下,将本节作为领域上下文用于编写评估准则提示——不会生成SDK类。

Imports

导入

python
undefined
python
undefined

Core classes

核心类

from ddtrace.llmobs._experiment import BaseEvaluator, EvaluatorContext, EvaluatorResult
from ddtrace.llmobs._experiment import BaseEvaluator, EvaluatorContext, EvaluatorResult

LLM-as-judge

LLM作为Judge

from ddtrace.llmobs._evaluators.llm_judge import ( LLMJudge, BooleanStructuredOutput, ScoreStructuredOutput, CategoricalStructuredOutput, )
from ddtrace.llmobs._evaluators.llm_judge import ( LLMJudge, BooleanStructuredOutput, ScoreStructuredOutput, CategoricalStructuredOutput, )

Built-in evaluators (use only if needed)

内置评估器(仅在需要时使用)

from ddtrace.llmobs._evaluators.format import JSONEvaluator, LengthEvaluator from ddtrace.llmobs._evaluators.string_matching import StringCheckEvaluator, RegexMatchEvaluator

Only import what the generated file actually uses.
from ddtrace.llmobs._evaluators.format import JSONEvaluator, LengthEvaluator from ddtrace.llmobs._evaluators.string_matching import StringCheckEvaluator, RegexMatchEvaluator

仅导入生成文件实际使用的类。

EvaluatorContext (what
evaluate()
receives)

EvaluatorContext(
evaluate()
接收的参数)

python
@dataclass(frozen=True)
class EvaluatorContext:
    input_data: dict[str, Any]          # Task inputs (from dataset record, NOT from span)
    output_data: Any                     # Task output (from task function return, NOT from span)
    expected_output: Optional[JSONType] = None  # Ground truth (if available)
    metadata: dict[str, Any] = {}        # Additional metadata
    span_id: Optional[str] = None        # LLMObs span ID
    trace_id: Optional[str] = None       # LLMObs trace ID
Important — span data vs evaluator data: When exploring production traces, you see span I/O (e.g.,
input.value
,
output.messages
). But evaluators run in offline experiments where
input_data
and
output_data
come from the user's dataset records and task function, not from spans. The dataset schema is user-defined and may not match span structure. Write evaluator prompts with generic
{{input_data}}
/
{{output_data}}
placeholders and add comments describing what data the evaluator was designed for, so the user can adapt to their dataset shape.
python
@dataclass(frozen=True)
class EvaluatorContext:
    input_data: dict[str, Any]          # 任务输入(来自数据集记录,而非Span)
    output_data: Any                     # 任务输出(来自任务函数返回值,而非Span)
    expected_output: Optional[JSONType] = None  # 基准真值(如果可用)
    metadata: dict[str, Any] = {}        # 附加元数据
    span_id: Optional[str] = None        # LLMObs Span ID
    trace_id: Optional[str] = None       # LLMObs Trace ID
重要——Span数据与评估器数据的区别:在探索生产Trace时,看到的是Span输入/输出(例如
input.value
output.messages
)。但评估器在离线实验中运行,
input_data
output_data
来自用户的数据集记录和任务函数,而非Span。数据集schema由用户定义,可能与Span结构不匹配。在评估器提示中使用通用的
{{input_data}}
/
{{output_data}}
占位符,并添加注释说明评估器设计用于何种数据,以便用户适配其数据集结构。

EvaluatorResult (what
evaluate()
returns)

EvaluatorResult(
evaluate()
返回的结果)

python
EvaluatorResult(
    value=...,                    # Required. JSONType (str, int, float, bool, None, list, dict)
    reasoning="...",              # Optional. Explanation string
    assessment="pass" or "fail",  # Optional. Pass/fail assessment
    metadata={...},              # Optional. Evaluation metadata dict
    tags={...},                  # Optional. Tags dict
)
python
EvaluatorResult(
    value=...,                    # 必填。JSONType(字符串、整数、浮点数、布尔值、None、列表、字典)
    reasoning="...",              # 可选。解释字符串
    assessment="pass" or "fail",  # 可选。通过/失败评估结果
    metadata={...},              # 可选。评估元数据字典
    tags={...},                  # 可选。标签字典
)

LLMJudge — LLM-as-Judge Evaluator

LLMJudge——LLM作为Judge的评估器

python
judge = LLMJudge(
    user_prompt="...",              # Required. Supports {{template_vars}}
    system_prompt="...",            # Optional. Does NOT support template vars
    structured_output=...,          # Optional. Boolean/Score/Categorical output, or a dict for custom JSON schema
    provider="openai",              # "openai" | "anthropic" | "azure_openai" | "vertexai" | "bedrock"
    model="gpt-4o",                # Model identifier
    model_params={"temperature": 0.0},  # Optional. Passed to LLM API
    name="eval_name",              # Optional. Must match ^[a-zA-Z0-9_-]+$
)
Template variables in
user_prompt
:
{{input_data}}
,
{{output_data}}
,
{{expected_output}}
,
{{metadata.key}}
— resolved from
EvaluatorContext
fields via dot-path into nested dicts.
python
judge = LLMJudge(
    user_prompt="...",              # 必填。支持{{template_vars}}
    system_prompt="...",            # 可选。不支持模板变量
    structured_output=...,          # 可选。布尔值/分数/分类输出,或自定义JSON schema的字典
    provider="openai",              # "openai" | "anthropic" | "azure_openai" | "vertexai" | "bedrock"
    model="gpt-4o",                # 模型标识符
    model_params={"temperature": 0.0},  # 可选。传递给LLM API的参数
    name="eval_name",              # 可选。必须匹配^[a-zA-Z0-9_-]+$
)
user_prompt
中的模板变量
{{input_data}}
{{output_data}}
{{expected_output}}
{{metadata.key}}
——通过点路径从
EvaluatorContext
字段解析嵌套字典。

Structured Output Types

结构化输出类型

Boolean — true/false with optional pass/fail:
python
BooleanStructuredOutput(
    description="Whether the response is factually accurate",
    reasoning=True,                    # Include reasoning field in LLM response
    reasoning_description=None,        # Optional custom description for reasoning field
    pass_when=True,                    # True → pass when true, False → pass when false, None → no assessment
)
Score — numeric within a range with optional thresholds:
python
ScoreStructuredOutput(
    description="Helpfulness score",
    min_score=1,                       # Minimum possible score
    max_score=10,                      # Maximum possible score
    reasoning=True,
    reasoning_description=None,
    min_threshold=7,                   # Scores >= 7 pass (optional)
    max_threshold=None,                # Scores <= N pass (optional)
)
Categorical — select from predefined categories:
python
CategoricalStructuredOutput(
    categories={
        "correct": "The response correctly answers the question",
        "partially_correct": "The response is partially correct but missing key information",
        "incorrect": "The response is factually wrong or irrelevant",
    },
    reasoning=True,
    reasoning_description=None,
    pass_values=["correct"],           # Which categories count as passing (optional)
)
Custom JSON schema — arbitrary structured responses for multi-dimensional evals:
python
undefined
布尔值——真/假,可选通过/失败标记:
python
BooleanStructuredOutput(
    description="Whether the response is factually accurate",
    reasoning=True,                    # 在LLM响应中包含推理字段
    reasoning_description=None,        # 推理字段的可选自定义描述
    pass_when=True,                    # True→为真时通过,False→为假时通过,None→无评估结果
)
分数——指定范围内的数值,可选阈值:
python
ScoreStructuredOutput(
    description="Helpfulness score",
    min_score=1,                       # 最小可能分数
    max_score=10,                      # 最大可能分数
    reasoning=True,
    reasoning_description=None,
    min_threshold=7,                   # 分数≥7时通过(可选)
    max_threshold=None,                # 分数≤N时通过(可选)
)
分类——从预定义类别中选择:
python
CategoricalStructuredOutput(
    categories={
        "correct": "The response correctly answers the question",
        "partially_correct": "The response is partially correct but missing key information",
        "incorrect": "The response is factually wrong or irrelevant",
    },
    reasoning=True,
    reasoning_description=None,
    pass_values=["correct"],           # 哪些类别算作通过(可选)
)
自定义JSON schema——用于多维评估的任意结构化响应:
python
undefined

Pass a raw dict as structured_output — used as the JSON schema directly

传递原始字典作为structured_output——直接用作JSON schema

structured_output={ "type": "object", "properties": { "relevance": {"type": "boolean", "description": "Whether the response addresses the question"}, "confidence": {"type": "number", "description": "Confidence score (0.0 to 1.0)"}, "reasoning": {"type": "string", "description": "Explanation for the evaluation"}, }, "required": ["relevance", "confidence", "reasoning"], "additionalProperties": False, }

Always write standard JSON schema — the SDK adapts it per provider automatically (e.g., Anthropic doesn't support `minimum`/`maximum` on number fields, so the SDK moves range constraints into the `description`; Vertex AI converts `const`/`anyOf` to `enum`). The full parsed JSON dict becomes the eval `value`; a `"reasoning"` key (if present) is automatically extracted. No automatic pass/fail assessment.
structured_output={ "type": "object", "properties": { "relevance": {"type": "boolean", "description": "Whether the response addresses the question"}, "confidence": {"type": "number", "description": "Confidence score (0.0 to 1.0)"}, "reasoning": {"type": "string", "description": "Explanation for the evaluation"}, }, "required": ["relevance", "confidence", "reasoning"], "additionalProperties": False, }

始终编写标准JSON schema——SDK会自动根据提供商进行适配(例如,Anthropic不支持数字字段的`minimum`/`maximum`,因此SDK会将范围约束移至`description`中;Vertex AI将`const`/`anyOf`转换为`enum`)。完整解析后的JSON字典成为评估的`value`;如果存在`"reasoning"`键,会自动提取。不会自动生成通过/失败评估结果。

LLMJudge Prompt Guidelines

LLMJudge提示准则

The
structured_output
parameter enforces the response format via JSON schema. Do not prescribe the format in the prompt (no "Answer YES/NO", "Rate 1-10", etc.). Instead, describe the evaluation criteria and let the structured output handle the format.
  • system_prompt: Set the judge's role and the app's domain context. Does NOT support template vars.
  • user_prompt: Present the data via
    {{input_data}}
    /
    {{output_data}}
    , then describe what good vs. bad looks like for this dimension.
structured_output
参数通过JSON schema强制响应格式。不要在提示中指定格式(例如“回答YES/NO”、“评分1-10”等)。相反,描述评估标准,让结构化输出处理格式问题。
  • system_prompt:设置Judge的角色和应用的领域上下文。不支持模板变量。
  • user_prompt:通过
    {{input_data}}
    /
    {{output_data}}
    呈现数据,然后描述该维度下好与坏的表现。

BaseEvaluator — Custom Code-Based Evaluator

BaseEvaluator——基于自定义代码的评估器

For deterministic checks that do not need LLM judgment:
python
class MyEvaluator(BaseEvaluator):
    def __init__(self, name=None, ...custom_params...):
        super().__init__(name=name)
        self._param = ...  # Store config as private attrs

    def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
        # Access: context.input_data, context.output_data, context.expected_output, context.metadata
        # Must NOT modify self attributes (thread safety)
        passed = ...  # Your logic here
        return EvaluatorResult(
            value=passed,
            reasoning="...",
            assessment="pass" if passed else "fail",
        )
用于无需LLM判断的确定性检查:
python
class MyEvaluator(BaseEvaluator):
    def __init__(self, name=None, ...custom_params...):
        super().__init__(name=name)
        self._param = ...  # 将配置存储为私有属性

    def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
        # 访问:context.input_data, context.output_data, context.expected_output, context.metadata
        # 不得修改self属性(线程安全)
        passed = ...  # 此处编写你的逻辑
        return EvaluatorResult(
            value=passed,
            reasoning="...",
            assessment="pass" if passed else "fail",
        )

Built-in Evaluators

内置评估器

python
undefined
python
undefined

Validate JSON syntax + optional required keys

验证JSON语法 + 可选必填键

JSONEvaluator(required_keys=["name", "age"], output_extractor=None, name=None)
JSONEvaluator(required_keys=["name", "age"], output_extractor=None, name=None)

Validate length (characters, words, or lines)

验证长度(字符、单词或行数)

LengthEvaluator(count_by="words", min_length=10, max_length=500, output_extractor=None, name=None)
LengthEvaluator(count_by="words", min_length=10, max_length=500, output_extractor=None, name=None)

count_by: "characters" | "words" | "lines"

count_by: "characters" | "words" | "lines"

String matching

字符串匹配

StringCheckEvaluator(operation="contains", expected="success", case_sensitive=False, name=None)
StringCheckEvaluator(operation="contains", expected="success", case_sensitive=False, name=None)

operation: "eq" | "ne" | "contains" | "icontains"

operation: "eq" | "ne" | "contains" | "icontains"

Regex matching

正则匹配

RegexMatchEvaluator(pattern=r"\d{4}-\d{2}-\d{2}", match_mode="search", name=None)
RegexMatchEvaluator(pattern=r"\d{4}-\d{2}-\d{2}", match_mode="search", name=None)

match_mode: "search" | "match" | "fullmatch"

match_mode: "search" | "match" | "fullmatch"

undefined
undefined

Evaluator Type Decision Matrix

评估器类型决策矩阵

SignalEvaluator Type
Output must be valid JSON
JSONEvaluator
Output must match a regex pattern
RegexMatchEvaluator
Output has length constraints
LengthEvaluator
Output must contain/not contain specific strings
StringCheckEvaluator
Semantic quality judgment (tone, accuracy, completeness)
LLMJudge
+
BooleanStructuredOutput
Graded quality on a scale
LLMJudge
+
ScoreStructuredOutput
Classification into categories
LLMJudge
+
CategoricalStructuredOutput
Multi-dimensional judgment (evaluate several aspects at once)
LLMJudge
+ custom JSON schema
dict
Complex domain logic combining multiple checks
BaseEvaluator
subclass
信号评估器类型
输出必须是有效的JSON
JSONEvaluator
输出必须匹配正则模式
RegexMatchEvaluator
输出有长度限制
LengthEvaluator
输出必须包含/不包含特定字符串
StringCheckEvaluator
语义质量判断(语气、准确性、完整性)
LLMJudge
+
BooleanStructuredOutput
按比例评分的质量
LLMJudge
+
ScoreStructuredOutput
分类到类别中
LLMJudge
+
CategoricalStructuredOutput
多维判断(同时评估多个方面)
LLMJudge
+ 自定义JSON schema
dict
结合多个检查的复杂领域逻辑
BaseEvaluator
子类

Source Verification

源码验证

If you have access to dd-trace-py locally, verify the API surface by reading the corresponding modules:
  • ddtrace.llmobs._evaluators.llm_judge
    LLMJudge
    ,
    BooleanStructuredOutput
    ,
    ScoreStructuredOutput
    ,
    CategoricalStructuredOutput
  • ddtrace.llmobs._experiment
    BaseEvaluator
    ,
    EvaluatorContext
    ,
    EvaluatorResult
  • ddtrace.llmobs._evaluators.format
    JSONEvaluator
    ,
    LengthEvaluator
  • ddtrace.llmobs._evaluators.string_matching
    StringCheckEvaluator
    ,
    RegexMatchEvaluator

如果本地可以访问dd-trace-py,通过阅读相应模块验证API接口:
  • ddtrace.llmobs._evaluators.llm_judge
    LLMJudge
    BooleanStructuredOutput
    ScoreStructuredOutput
    CategoricalStructuredOutput
  • ddtrace.llmobs._experiment
    BaseEvaluator
    EvaluatorContext
    EvaluatorResult
  • ddtrace.llmobs._evaluators.format
    JSONEvaluator
    LengthEvaluator
  • ddtrace.llmobs._evaluators.string_matching
    StringCheckEvaluator
    RegexMatchEvaluator

Workflow

工作流

Phase 0: Resolve Inputs & Entry Mode

阶段0:解析输入与确定进入模式

Entry mode detection:
ModeSignalBehavior
Cold StartOnly
ml_app
provided (no RCA, no hypothesis)
Full open discovery — understand what the app does, identify quality dimensions worth measuring, propose evals for coverage
From RCAConversation contains an RCA report or user provides a failure hypothesisSkip open discovery — use existing failure taxonomy as eval targets
Parse arguments: Extract
ml_app
(first non-flag argument),
--timeframe
(default
now-7d
),
--data-only
, and
--publish
flags. Set
output_mode = publish
if
--publish
is set,
output_mode = data_only
if
--data-only
is set, otherwise
output_mode = sdk_code
. Error if both
--data-only
and
--publish
are present.
Resolution steps:
  1. If
    ml_app
    not provided → ask the user.
  2. Auto-detect entry mode:
    • If the conversation contains an RCA report (look for "Failure Taxonomy" heading, structured failure modes, or severity ratings) →
      from_rca
      . Extract the taxonomy.
    • If the user provides a free-text failure hypothesis (e.g., "the system prompt lacks grounding") →
      from_rca
      . Use the hypothesis as the starting eval target.
    • Otherwise →
      cold_start
      .
  3. If
    timeframe
    not provided → default to
    now-7d
    .
  4. Map existing eval coverageskip if
    output_mode = data_only
    (there is no Datadog eval project to check coverage against): Call
    list_llmobs_evals
    (org-wide; filter the result client-side to entries where
    ml_app == <ml_app>
    ). Then, for each eval with
    source=custom
    , call
    get_llmobs_evaluator(eval_name=...)
    to inspect its prompt template, target, sampling, and filter, and infer which quality dimension it covers. Issue all evaluator calls in a single message (parallelize). Skip
    source=ootb
    evals — their names are self-describing and they may not have a fetchable config.
    By the end of this step you have a complete coverage map:
    {eval_name → source, enabled, dimension}
    . Carry this into Phase 2 for deduplication.
    In
    publish
    mode, also note any template-variable convention
    the existing custom evaluators already use (so a new suite reads consistently). Online evaluator templates resolve against the full span JSON, not against
    EvaluatorContext
    . See the "Online Template Variables" section under "Publishing Conventions" for the supported syntax (
    {{span_input}}
    ,
    {{span_output}}
    , dot-paths, array selectors, filter accessors).
  5. Notebook context detection: Scan the current conversation for a Datadog notebook URL that was produced by
    /eval-trace-rca
    (pattern:
    https://app.datadoghq.com/notebook/{numeric-id}
    ). If found, store it as
    rca_notebook_url
    and extract the numeric ID as
    rca_notebook_id
    . This is used after Phase 3 to offer appending the evaluator suite to that notebook instead of creating a new one.

进入模式检测
模式信号行为
冷启动仅提供
ml_app
(无RCA、无假设)
全面开放探索——了解应用功能,确定值得测量的质量维度,提出评估器以覆盖这些维度
来自RCA对话包含RCA报告或用户提供失败假设跳过开放探索——使用现有故障分类作为评估目标
解析参数:提取
ml_app
(第一个非标志参数)、
--timeframe
(默认
now-7d
)、
--data-only
--publish
标志。如果设置
--publish
,则
output_mode = publish
;如果设置
--data-only
,则
output_mode = data_only
;否则
output_mode = sdk_code
。如果同时提供
--data-only
--publish
,抛出错误。
解析步骤
  1. 如果未提供
    ml_app
    → 询问用户。
  2. 自动检测进入模式:
    • 如果对话包含RCA报告(查找“Failure Taxonomy”标题、结构化故障模式或严重性评级)→
      from_rca
      。提取分类信息。
    • 如果用户提供自由文本形式的失败假设(例如“系统提示缺乏基础信息”)→
      from_rca
      。将该假设作为初始评估目标。
    • 否则 →
      cold_start
  3. 如果未提供
    timeframe
    → 默认使用
    now-7d
  4. 映射现有评估覆盖范围如果
    output_mode = data_only
    则跳过
    (无Datadog评估项目可检查覆盖范围):调用
    list_llmobs_evals
    (全组织范围;在客户端过滤
    ml_app == <ml_app>
    的条目)。然后,对于每个
    source=custom
    的评估器,调用
    get_llmobs_evaluator(eval_name=...)
    以检查其提示模板、目标、采样和过滤规则,并推断其覆盖的质量维度。在单个消息中发起所有评估器调用(并行化)。跳过
    source=ootb
    的评估器——它们的名称自描述,且可能无法获取配置。
    此步骤结束后,你将获得完整的覆盖范围映射:
    {eval_name → source, enabled, dimension}
    。将其带入阶段2以进行去重。
    在发布模式下,还需注意现有自定义评估器已使用的模板变量约定(以便新套件保持一致)。在线评估器模板针对完整Span JSON解析,而非
    EvaluatorContext
    。有关支持的语法(
    {{span_input}}
    {{span_output}}
    、点路径、数组选择器、过滤器访问器),请参阅“发布约定”下的“在线模板变量”部分。
  5. Notebook上下文检测:扫描当前对话,查找由
    /eval-trace-rca
    生成的Datadog Notebook URL(模式:
    https://app.datadoghq.com/notebook/{numeric-id}
    )。如果找到,将其存储为
    rca_notebook_url
    ,并提取数字ID作为
    rca_notebook_id
    。阶段3结束后,将使用此信息提供将评估器套件附加到该Notebook而非创建新Notebook的选项。

Phase 1: Explore Traces & Identify Eval Targets

阶段1:探索Trace并确定评估目标

Goal: Sample production traces, understand what the app does, and identify quality dimensions worth measuring.
目标:采样生产Trace,了解应用功能,确定值得测量的质量维度。

Cold Start Path

冷启动路径

  1. Sample the app:
    search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", root_spans_only=true, limit=50, from=<timeframe>)
    . Filter by
    @status:ok
    — error spans have no output to evaluate.
  2. Profile the app and identify evaluation target spans: Call
    get_llmobs_span_details
    for span_ids grouped by trace_id. Inspect
    content_info
    to classify:
    SignalApp Profile
    content_info
    has
    messages
    LLM/chat app
    content_info
    has
    documents
    RAG app
    Spans include
    agent
    kind
    Agent app
    content_info
    has
    metadata
    Has custom metadata
    Multiple span kinds in one trace (
    agent
    +
    tool
    /
    retrieval
    +
    llm
    from
    get_llmobs_trace
    )
    Multi-step app — at least one trace-scope evaluator likely belongs in the suite (
    publish
    mode)
    For agent/multi-step apps, also call
    get_llmobs_trace
    on 2-3 traces to see the full span hierarchy. Compare
    content_info
    between the root span and its sub-spans. Then ask two questions for each candidate quality dimension, in this order:
    1. Does the verdict depend on more than one span? (e.g., faithfulness depends on a
      retrieval
      span's documents AND an
      llm
      span's answer; goal completion depends on the chain of
      tool
      calls AND the final response.) If yes → trace scope in
      publish
      mode. Don't try to compress this into a single span.
    2. Only if the answer to (1) is no: pick the single span with the richest signal for that dimension (root has the summary; LLM sub-spans have the full system prompt + tool call results + reasoning chain).
    Record the span-kind histogram (agent + tool + llm + retrieval) — multiple kinds under one root is a strong signal you'll have at least one trace-scope evaluator in the suite. See Phase 2's "Span vs. Trace Scope Classification" for the mandatory walk-through of canonical trace-scope use cases.
  3. Extract content and identify targets: Call
    get_llmobs_span_content
    for representative spans. Fetch fields based on app profile:
    App ProfileFields to Fetch
    LLM/chat
    messages
    (
    path=$.messages[0]
    for system prompt),
    output
    RAG
    documents
    ,
    input
    ,
    output
    Agent
    get_llmobs_agent_loop
    for the agent span, then
    messages
    for detail
    Any with metadata
    metadata
    Issue all calls in a single message. As you read, capture two streams of signal:
    Generic quality signals — what does "success" look like? What variance exists across outputs? Each observed quality dimension becomes a candidate evaluator, with the traces you've just read as evidence. Also look for safety signals (scope violations, sensitive data in outputs, out-of-character responses) and add a safety evaluator if you find them.
    Domain signals — these become the domain-specific evaluator category in Phase 2 (the highest-leverage category). For every 5–10 traces, write down:
    • Recurring intents / question categories — what classes of request does this app handle? (
      applying for benefit X
      ,
      comparing flight options
      ,
      summarizing a policy
      ,
      creating a widget
      )
    • Entities the app emits in outputs — URLs, agency / company names, code identifiers, monetary amounts, dates, IDs, file paths, phone numbers. Note which ones the user acts on downstream (those are worth a correctness evaluator) versus which are passing references.
    • Tool argument shapes (for agent apps) — name each tool the agent calls and the rough schema of its inputs. Tools with non-trivial schemas (≥ 3 fields, structured types) are candidates for argument-correctness evaluators.
    • Persona / voice rules — does the app always cite a source, always refuse certain topics (medical, legal, financial advice), always speak in a particular tone? Extract the rules implicitly followed across observed outputs.
    • Failure modes specific to the domain — fabricated identifiers, outdated policy references, currency / locale mismatches, off-by-one errors in IDs, wrong units. One observed instance is enough to seed a candidate evaluator.
    Don't try to enumerate domain signals exhaustively before reading traces — let the patterns surface as you read. The goal is breadth in the eventual proposal, not completeness in this exploration step.
  1. 采样应用
    search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", root_spans_only=true, limit=50, from=<timeframe>)
    。按
    @status:ok
    过滤——错误Span无输出可评估。
  2. 分析应用概况并确定评估目标Span:按trace_id分组调用
    get_llmobs_span_details
    获取span_ids。检查
    content_info
    进行分类:
    信号应用概况
    content_info
    包含
    messages
    LLM/聊天应用
    content_info
    包含
    documents
    RAG应用
    Span包含
    agent
    类型
    代理应用
    content_info
    包含
    metadata
    包含自定义元数据
    单个Trace中包含多种Span类型(
    agent
    +
    tool
    /
    retrieval
    +
    llm
    ,来自
    get_llmobs_trace
    多步骤应用——套件中可能至少包含一个Trace范围的评估器(发布模式)
    对于代理/多步骤应用,还需调用
    get_llmobs_trace
    获取2-3个Trace的完整Span层级结构。比较根Span与其子Span的
    content_info
    。然后针对每个候选质量维度依次提出两个问题:
    1. 判断结果是否依赖多个Span?(例如,忠实度依赖
      retrieval
      Span的文档和
      llm
      Span的回答;目标完成依赖
      tool
      调用链和最终响应。)如果是 → 发布模式下使用Trace范围。不要尝试将其压缩到单个Span中。
    2. 仅当(1)的答案为否时:选择该维度信号最丰富的单个Span(根Span包含摘要;LLM子Span包含完整系统提示 + 工具调用结果 + 推理链)。
    记录Span类型直方图(agent + tool + llm + retrieval)——单个根Span下包含多种类型是套件中至少包含一个Trace范围评估器的强烈信号。有关规范Trace范围用例的强制说明,请参阅阶段2的“Span vs Trace范围分类”。
  3. 提取内容并确定目标:调用
    get_llmobs_span_content
    获取代表性Span的内容。根据应用概况获取字段:
    应用概况要获取的字段
    LLM/聊天
    messages
    path=$.messages[0]
    获取系统提示)、
    output
    RAG
    documents
    input
    output
    代理获取代理Span的
    get_llmobs_agent_loop
    ,然后获取
    messages
    详情
    包含元数据的任何应用
    metadata
    在单个消息中发起所有调用。阅读时,捕获两类信号:
    通用质量信号 — “成功”的表现是什么?输出存在哪些差异?每个观察到的质量维度都成为候选评估器,你刚刚读取的Trace作为证据。同时查找安全信号(范围违规、输出中的敏感数据、不符合角色的响应),如果发现则添加安全评估器。
    领域信号 — 这些将成为阶段2中的领域特定评估器类别(价值最高的类别)。每读取5-10个Trace,记录:
    • 重复意图/问题类别 — 应用处理哪些类型的请求?(
      申请福利X
      比较航班选项
      总结政策
      创建小部件
    • 应用在输出中生成的实体 — URL、机构/公司名称、代码标识符、金额、日期、ID、文件路径、电话号码。注意哪些是用户后续会操作的(这些值得添加正确性评估器),哪些只是引用。
    • 工具参数形状(针对代理应用) — 命名代理调用的每个工具及其输入的大致schema。具有非平凡schema(≥3个字段、结构化类型)的工具是参数正确性评估器的候选对象。
    • 角色/语气规则 — 应用是否始终引用来源、始终拒绝某些主题(医疗、法律、财务建议)、始终使用特定语气?从观察到的输出中提取隐含遵循的规则。
    • 特定领域的故障模式 — 虚构标识符、过时政策引用、货币/区域不匹配、ID中的差一错误、错误单位。只要观察到一个实例,就足以作为候选评估器的种子。
    不要在读取Trace前尝试穷举领域信号——让模式在读取过程中自然浮现。目标是最终提案的广度,而非此探索步骤的完整性。

From RCA Path

来自RCA的路径

  1. Extract the failure taxonomy from the RCA report. Each failure mode with High or Medium severity becomes an eval target.
  2. Check root cause categories for infrastructure failures. Before proposing evaluators, scan the Root Cause column of the taxonomy for any of:
    Instrumentation Deficiency
    ,
    Harness Deficiency
    ,
    Runtime Error
    ,
    Upstream Data Issue
    , or any other root cause that points to infrastructure/environment rather than model behavior. If any are present, pause and ask:
    "Some failure modes were diagnosed as infrastructure or instrumentation issues rather than model behavior (e.g.,
    {list the infra root causes}
    ). Evaluators can be designed two ways:
    • Behavior-targeted (recommended for ongoing quality): measure whether the model produces correct, specific output — useful once the infrastructure is fixed and you want to track real quality
    • Artifact-targeted (useful as regression guard): detect the specific broken output observed (e.g., generic placeholder responses) — catches regressions if the infrastructure breaks again
    Which approach do you want, or both?"
    • If behavior-targeted: design evaluators for what correct output looks like, not what the broken output looked like. Use the RCA's
      expected_output
      / gold-standard examples as the quality bar.
    • If artifact-targeted: design evaluators that detect the specific failure symptom (e.g.,
      StringCheckEvaluator
      for a known bad string,
      LLMJudge
      that checks for generic placeholders).
    • If both: propose each category separately, clearly labelled.
    If all root causes are behavioral (System Prompt Deficiency, Tool Gap, Tool Misuse, Retrieval Failure, etc.) → skip this step and proceed directly.
  3. For each target: if the RCA includes trace IDs, use them directly; otherwise search for matching traces. Fetch 2-3 traces per target with
    get_llmobs_span_content
    to understand the concrete pattern.

  1. 从RCA报告中提取故障分类。每个具有高或中严重性的故障模式都成为评估目标。
  2. 检查根本原因类别是否包含基础设施故障。在提出评估器前,扫描分类的“Root Cause”列,查找以下任何一项:
    Instrumentation Deficiency
    Harness Deficiency
    Runtime Error
    Upstream Data Issue
    ,或任何其他指向基础设施/环境而非模型行为的根本原因。如果存在,暂停并询问:
    "某些故障模式被诊断为基础设施或工具问题,而非模型行为(例如
    {列出基础设施根本原因}
    )。评估器可通过两种方式设计:
    • 行为导向(推荐用于持续质量监控):测量模型是否生成正确、特定的输出——在基础设施修复后,用于跟踪实际质量
    • ** artifact导向**(用作回归防护):检测观察到的特定故障输出(例如通用占位符响应)——如果基础设施再次故障,可捕获回归问题
    你想要哪种方法,还是两者都要?"
    • 如果选择行为导向:设计评估器以衡量正确输出的表现,而非故障输出的表现。使用RCA中的
      expected_output
      /黄金标准示例作为质量标准。
    • 如果选择artifact导向:设计评估器以检测特定故障症状(例如针对已知错误字符串的
      StringCheckEvaluator
      、检查通用占位符的
      LLMJudge
      )。
    • 如果选择两者都要:分别提出每个类别,明确标记。
    如果所有根本原因都是行为相关的(System Prompt Deficiency、Tool Gap、Tool Misuse、Retrieval Failure等)→ 跳过此步骤,直接继续。
  3. 针对每个目标:如果RCA包含Trace ID,直接使用;否则搜索匹配的Trace。调用
    get_llmobs_span_content
    获取每个目标的2-3个Trace,以了解具体模式。

Phase 2: Propose Evaluator Suite

阶段2:提出评估器套件

Goal: Present a concrete evaluator proposal for user confirmation.
Each evaluator judges one data point — it receives input and output for a single record/span, not a full trace or batch. Design evaluators accordingly.
Targeting depends on
output_mode
:
  • sdk_code
    /
    data_only
    offline experiments. Template variables use
    EvaluatorContext
    fields (
    {{input_data}}
    ,
    {{output_data}}
    ). The actual data shape depends on the user's dataset and task function (see EvaluatorContext note in SDK Reference).
  • publish
    online evaluation on production spans. Template variables resolve against the full span JSON via dot-paths (
    {{meta.input.value}}
    ,
    {{meta.output.messages[*].content}}
    , …) or the built-in span-kind-aware aliases (
    {{span_input}}
    ,
    {{span_output}}
    ). See "Online Template Variables" under Publishing Conventions for the full syntax. Each evaluator also needs
    eval_scope
    ,
    sampling_percentage
    , and (optionally)
    filter
    — surface these in the proposal table so the user can confirm before publishing.
Order proposals from broadest signal to most granular. Propose broadly, let the user curate — see "How many evaluators to propose" below.
  1. Domain-specific evaluators — What does "good" mean for this specific app? These are the highest-leverage proposals because they capture quality bars generic evaluators miss. Derive them from the domain signals Phase 1 captured:
    • Recurring intents / question categories the app handles (e.g., "applying for a federal benefit", "comparing flight options", "explaining a policy"). Propose an
      intent_classification
      or
      intent_handling_correctness
      evaluator scoped to the dominant intents.
    • Specific entities the app produces (URLs, agency names, code identifiers, monetary amounts, dates, IDs). Propose a per-entity correctness evaluator for the ones with real downstream cost when wrong (e.g.,
      cited_url_is_real
      ,
      agency_name_matches_request
      ,
      monetary_amount_is_consistent_with_input
      ).
    • Tool argument shapes observed across
      tool
      spans. Propose a per-tool argument-correctness evaluator for the tools with non-trivial schemas (e.g.,
      search_flights_args_match_user_request
      ,
      update_dashboard_widget_targets_correct_widget
      ).
    • Persona / voice expectations — does the app always cite sources, always refuse out-of-scope requests, always speak in a specific tone? Propose evaluators for the voice rules you can extract from observed outputs (
      cites_a_source
      ,
      refuses_medical_advice
      ,
      tone_matches_brand
      ).
    • Domain-specific failure modes seen across traces (fabricated identifiers, outdated policy references, unit mismatches, currency / locale mismatches). One evaluator per recurring failure mode.
    Name each evaluator after the user-facing concern, not the technical check (
    agency_url_is_real
    over
    regex_url_match
    ). Use the trace IDs you read in Phase 1 as evidence — at least one passing case and one failing case per evaluator if you saw both.
  2. Outcome evaluators — Did this span / trace produce a good result for the request?
    • Examples:
      task_completion
      ,
      answer_correctness
      ,
      response_groundedness
  3. Format evaluators — Does the output meet structural requirements?
    • Examples:
      valid_json_output
      ,
      response_length
      ,
      citation_format
  4. Safety evaluators — Does the output stay within appropriate boundaries?
    • Examples:
      no_pii_leakage
      ,
      scope_adherence
      ,
      no_hallucination
目标:提出具体的评估器提案供用户确认。
每个评估器判断一个数据点——接收单个记录/Span的输入和输出,而非完整Trace或批量数据。据此设计评估器。
目标定位取决于
output_mode
  • sdk_code
    /
    data_only
    离线实验。模板变量使用
    EvaluatorContext
    字段(
    {{input_data}}
    {{output_data}}
    )。实际数据形状取决于用户的数据集和任务函数(请参阅SDK参考中的EvaluatorContext说明)。
  • publish
    生产Span上的在线评估。模板变量通过点路径针对完整Span JSON解析(
    {{meta.input.value}}
    {{meta.output.messages[*].content}}
    等),或使用内置的Span类型感知别名(
    {{span_input}}
    {{span_output}}
    )。有关完整语法,请参阅发布约定下的“在线模板变量”。每个评估器还需要
    eval_scope
    sampling_percentage
    和(可选)
    filter
    ——在提案表格中显示这些内容,以便用户在发布前确认。
按从宽泛到精细的顺序排列提案。广泛提出,让用户筛选——请参阅下面的“要提出多少个评估器”。
  1. 领域特定评估器 — 对于此特定应用,“好”的定义是什么?这些是价值最高的提案,因为它们捕获了通用评估器无法覆盖的质量标准。从阶段1捕获的领域信号中推导:
    • 应用处理的重复意图/问题类别(例如“申请联邦福利”、“比较航班选项”、“解释政策”)。针对主要意图提出
      intent_classification
      intent_handling_correctness
      评估器。
    • 应用生成的特定实体(URL、机构名称、代码标识符、金额、日期、ID)。针对错误会产生实际下游成本的实体提出每个实体的正确性评估器(例如
      cited_url_is_real
      agency_name_matches_request
      monetary_amount_is_consistent_with_input
      )。
    • tool
      Span中观察到的工具参数形状
      。针对具有非平凡schema的工具提出每个工具的参数正确性评估器(例如
      search_flights_args_match_user_request
      update_dashboard_widget_targets_correct_widget
      )。
    • 角色/语气期望 — 应用是否始终引用来源、始终拒绝超出范围的请求、始终使用特定语气?针对从观察到的输出中提取的语气规则提出评估器(
      cites_a_source
      refuses_medical_advice
      tone_matches_brand
      )。
    • Trace中发现的特定领域故障模式(虚构标识符、过时政策引用、单位不匹配、货币/区域不匹配)。每个重复故障模式对应一个评估器。
    每个评估器的名称以用户关注的问题命名,而非技术检查(例如使用
    agency_url_is_real
    而非
    regex_url_match
    )。使用阶段1中读取的Trace ID作为证据——如果同时看到通过和失败案例,每个评估器至少引用一个通过案例和一个失败案例。
  2. 结果评估器 — 此Span/Trace是否为请求生成了良好结果?
    • 示例:
      task_completion
      answer_correctness
      response_groundedness
  3. 格式评估器 — 输出是否符合结构要求?
    • 示例:
      valid_json_output
      response_length
      citation_format
  4. 安全评估器 — 输出是否保持在适当范围内?
    • 示例:
      no_pii_leakage
      scope_adherence
      no_hallucination
How many evaluators to propose
要提出多少个评估器
The default
4-6
cap from the older skill version was too tight — it pushed the skill toward generic evaluators only and left domain signals on the table. Updated guidance:
  • Aim for 8–15 evaluators in the proposal, distributed across all four categories (with domain-specific usually the largest bucket, outcome second, format and safety smaller). For very simple single-LLM-call apps, fewer is fine; for agent / RAG apps with rich domain signals, lean toward the upper end.
  • Quality > generic: every domain-specific proposal should be backed by at least one observed pattern in the sampled traces. Don't invent generic domain evaluators ("
    response_quality
    ") if you don't have evidence for them.
  • Let the user curate: the MANDATORY CHECKPOINT below explicitly asks the user to remove what doesn't apply, not just to approve. Treat the proposal as a candidate set the user trims.
旧版技能的默认4-6个上限过于严格——它迫使技能仅提出通用评估器,而忽略领域信号。更新后的指南:
  • 目标提出8-15个评估器,分布在所有四个类别中(领域特定通常是最大的类别,结果类其次,格式和安全类较小)。对于非常简单的单LLM调用应用,数量可以更少;对于具有丰富领域信号的代理/RAG应用,倾向于上限。
  • 质量优先于通用:每个领域特定提案都必须至少有一个采样Trace中的观察模式作为支持。如果没有证据,不要发明通用领域评估器(例如
    response_quality
    )。
  • 让用户筛选:下面的强制检查点明确要求用户移除不适用的评估器,而非仅仅批准。将提案视为用户会精简的候选集。

Deduplication Against Existing Coverage

与现有覆盖范围去重

In
data_only
mode
: skip this section entirely (coverage map was not built in Phase 0). Proceed directly to the proposal table.
Before building the proposal, apply the coverage map from Phase 0. Coverage is keyed on
(dimension, scope)
— not on dimension alone
: every OOTB evaluator runs at span scope, and an enabled OOTB eval does NOT preclude proposing a trace-scope evaluator for the same dimension. The two answer different questions.
  1. Enabled span-scope eval (OOTB or custom) for dimension D:
    • Do NOT propose a new span-scope evaluator for D — that dimension is already covered at span scope.
    • DO propose a trace-scope evaluator for D when the trace shape calls for it (multi-step app, judgment depends on cross-span context). Note the relationship in the rationale: e.g., "OOTB
      Goal Completeness
      evaluates each LLM span in isolation; this trace-scope
      goal_completion
      checks whether the agent's full sequence of steps achieved the user's request — different question."
  2. Enabled trace-scope custom eval for dimension D: do NOT propose another trace-scope evaluator for the same dimension; that's a real duplicate. Span-scope on the same dimension is still fair game if the data also fits a single span.
  3. Disabled OOTB eval: Do NOT propose a new custom span-scope evaluator for that dimension. Instead, surface it in a short note within the proposal and suggest enabling it in the Datadog UI rather than creating a duplicate. Example:
    hallucination
    (ootb, disabled) — consider enabling in Datadog UI (Evaluations → Configure) instead of creating a custom span-scope eval. (A trace-scope
    rag_faithfulness
    is still in scope and covers a different question.)
  4. Gap identification: Open the proposal with a coverage summary line: "Existing coverage: N evaluator(s) already configured ({names}, all span-scope unless noted). Proposing evaluators for uncovered dimensions and uncovered scopes."
  5. All dimensions covered: A dimension is "fully covered" only when both scopes are present (or the scope doesn't apply to the app shape). If the coverage map accounts for every identified quality dimension at the appropriate scope(s), surface this explicitly and ask the user what they want: (a) review/improve existing eval prompts, (b) add coverage for additional dimensions, or (c) proceed anyway.
For each proposed evaluator:
  • Name: Must match
    ^[a-zA-Z0-9_-]+$
    (alphanumeric, underscore, hyphen only)
  • Type:
    LLMJudge
    (Boolean/Score/Categorical/custom JSON schema), built-in (
    JSONEvaluator
    ,
    RegexMatchEvaluator
    , etc.), or
    BaseEvaluator
    subclass. In
    publish
    mode, only LLM-judge evaluators are supported by the MCP tool — code-based checks must NOT be silently dropped. List them in the same proposal table with
    Type
    set to the code-based class, mark them under a "Not publishable in this mode" subsection of the proposal, and tell the user to run the skill again in default
    sdk_code
    mode (or
    --data-only
    ) to capture them. Treat the code-based proposals as part of the suite for counting and coverage purposes.
  • What it measures: 1-2 sentence plain-language description
  • Target span: Which span's data the evaluator was designed for (e.g., "root agent span", "LLM sub-span
    anthropic.request
    ", "all
    llm
    spans"). If the root span's I/O is too lossy for the quality dimension (e.g., tool call results aren't visible), note this and specify which sub-span has the signal. In
    publish
    mode this maps to a combination of
    eval_scope
    (
    span
    /
    trace
    /
    session
    ),
    root_spans_only
    , and the EVP
    filter
    query (e.g.
    @meta.span.kind:llm
    or
    service:web
    ).
  • Pass/fail criteria:
    pass_when=True
    ,
    min_threshold=7
    ,
    pass_values=["correct"]
    , or "no automatic assessment" for custom JSON schema
  • Template variables: Which of
    input_data
    ,
    output_data
    ,
    expected_output
    ,
    metadata.*
    it uses (offline) — or which span paths / aliases it pulls from (publish mode:
    {{span_input}}
    ,
    {{span_output}}
    ,
    {{meta.input.messages[*].content}}
    ,
    {{meta.metadata.<key>}}
    , etc.)
  • Evidence: At least one trace where it would have caught a failure (or confirmed correct behavior)
  • Publish-only fields (only in
    publish
    mode)
    :
    integration_provider
    (default
    openai
    ),
    model_name
    (default
    gpt-5.4-mini
    ),
    sampling_percentage
    (default
    10
    ),
    eval_scope
    (default
    span
    ), and any
    filter
    query needed to scope to the right spans. Surface defaults in the proposal so the user can override before publishing.
  • integration_account_id
    (only in
    publish
    mode)
    : the integration account the judge LLM is called through. Auto-detected from existing evaluators in the same ml_app (Phase 0 coverage map). Never asked from the user as a raw UUID. If no existing evaluator has one, the field is omitted and the user picks an account in the UI before activating. All evaluators are published with
    enabled: false
    regardless
    — see "Always publish as draft" in Phase 3C for the full activation workflow.
data_only
模式下
:完全跳过本节(阶段0未构建覆盖范围映射)。直接进入提案表格。
在构建提案前,应用阶段0的覆盖范围映射。覆盖范围以
(dimension, scope)
为键——而非仅维度
:每个OOTB评估器都在Span范围运行,启用的OOTB评估器并不排除针对同一维度提出Trace范围评估器的可能性。两者回答的是不同的问题。
  1. 针对维度D的已启用Span范围评估器(OOTB或自定义)
    • 不要针对D提出新的Span范围评估器——该维度已在Span范围覆盖。
    • 当Trace形状需要时(多步骤应用、判断依赖跨Span上下文),针对D提出Trace范围评估器。在理由中说明关系:例如“OOTB
      Goal Completeness
      单独评估每个LLM Span;此Trace范围的
      goal_completion
      检查代理的完整步骤序列是否实现了用户请求——这是不同的问题。”
  2. 针对维度D的已启用Trace范围自定义评估器:不要针对同一维度提出另一个Trace范围评估器——这是真正的重复。如果数据也适合单个Span,同一维度的Span范围评估器仍然是可行的。
  3. 已禁用的OOTB评估器:不要针对该维度提出新的自定义Span范围评估器。相反,在提案中添加简短说明,建议在Datadog UI中启用它,而非创建重复项。示例:
    hallucination
    (ootb,已禁用)——考虑在Datadog UI(Evaluations → Configure)中启用,而非创建自定义Span范围评估器。(Trace范围的
    rag_faithfulness
    仍然适用,且覆盖不同的问题。)
  4. 差距识别:在提案开头添加覆盖范围摘要行:“现有覆盖范围:已配置N个评估器({名称},除非特别说明,否则均为Span范围)。针对未覆盖的维度和范围提出评估器。”
  5. 所有维度已覆盖:仅当两种范围都存在(或范围不适用于应用形状)时,维度才被“完全覆盖”。如果覆盖范围映射涵盖了所有已识别的质量维度及其适当范围,明确指出这一点并询问用户想要:(a) 审查/改进现有评估提示,(b) 添加对其他维度的覆盖,或(c) 继续执行。
针对每个拟议评估器:
  • 名称:必须匹配
    ^[a-zA-Z0-9_-]+$
    (仅字母数字、下划线、连字符)
  • 类型
    LLMJudge
    (布尔值/分数/分类/自定义JSON schema)、内置(
    JSONEvaluator
    RegexMatchEvaluator
    等),或
    BaseEvaluator
    子类。在发布模式下,MCP工具仅支持LLM-judge评估器——基于代码的检查不得被静默丢弃。在同一提案表格中列出它们,将
    Type
    设置为基于代码的类,在提案的“此模式下不可发布”小节中标记,并告知用户重新运行技能的默认
    sdk_code
    模式(或
    --data-only
    )以捕获它们。将基于代码的提案视为套件的一部分,用于计数和覆盖范围目的。
  • 测量内容:1-2句通俗易懂的描述
  • 目标Span:评估器设计用于哪个Span的数据(例如“根代理Span”、“LLM子Span
    anthropic.request
    ”、“所有
    llm
    Span”)。如果根Span的输入/输出对于质量维度来说信息损失过大(例如工具调用结果不可见),注明这一点并指定哪个子Span包含信号。在发布模式下,这对应于
    eval_scope
    span
    /
    trace
    /
    session
    )、
    root_spans_only
    和EVP
    filter
    查询的组合(例如
    @meta.span.kind:llm
    service:web
    )。
  • 通过/失败标准
    pass_when=True
    min_threshold=7
    pass_values=["correct"]
    ,或自定义JSON schema的“无自动评估结果”
  • 模板变量:使用
    input_data
    output_data
    expected_output
    metadata.*
    中的哪些(离线)——或从哪些Span路径/别名提取(发布模式:
    {{span_input}}
    {{span_output}}
    {{meta.input.messages[*].content}}
    {{meta.metadata.<key>}}
    等)
  • 证据:至少一个它会捕获故障(或确认正确行为)的Trace
  • 仅发布模式字段(仅在发布模式下):
    integration_provider
    (默认
    openai
    )、
    model_name
    (默认
    gpt-5.4-mini
    )、
    sampling_percentage
    (默认
    10
    )、
    eval_scope
    (默认
    span
    ),以及任何需要限定到正确Span的
    filter
    查询。在提案中显示默认值,以便用户在发布前覆盖。
  • integration_account_id
    (仅在发布模式下):Judge LLM调用所通过的集成账户。从同一ml_app中的现有评估器(阶段0覆盖范围映射)自动检测。切勿要求用户提供原始UUID。如果没有现有评估器包含该ID,省略该字段,用户在激活前在UI中选择账户。无论如何,所有评估器都以
    enabled: false
    发布
    ——有关完整激活工作流,请参阅阶段3C中的“始终以草稿形式发布”。

Span vs. Trace Scope Classification (
publish
mode)

Span vs Trace范围分类(发布模式)

Don't ask the user; classify per evaluator and let them override at the checkpoint.
不要询问用户;针对每个评估器进行分类,让用户在检查点覆盖
Mandatory: walk the four canonical trace-scope use cases first
强制:首先检查四个规范Trace范围用例
If Phase 1 found multi-step traces (≥ 2 span kinds, or any
tool
/
retrieval
/
workflow
span under an
agent
root), you MUST walk through the four canonical trace-scope use cases below before finalizing the suite. For each, decide explicitly: applies (include with
eval_scope: trace
) or does not apply (record a one-line reason in a "Skipped trace-scope candidates" subsection of the proposal). Skipping all four without per-item justification is a sign you've over-anchored on span scope — re-check.
Canonical use caseTriggers when
goal_completion
— did the agent finish the user's request?
Any agent / multi-step app. Almost always applies.
tool_use_correctness
— right tool with right arguments?
Trace contains
tool
kind spans.
rag_faithfulness
— answer grounded in retrieved documents?
Trace contains
retrieval
kind spans.
conversation_quality
— coherence across multi-turn LLM calls?
Trace contains ≥ 2
llm
spans, or app instruments multi-turn sessions.
For other proposed evaluators (e.g. tone, format, safety), apply this two-question test:
  1. Can the judgment be answered correctly from one span's
    meta.input
    +
    meta.output
    , where "correctly" means the verdict cannot change if you considered other spans in the trace? →
    eval_scope: span
    .
  2. Otherwise →
    eval_scope: trace
    . In particular, default to trace when the evaluator name contains grounding, faithfulness, hallucination, completeness, correctness across steps, consistency, or workflow — these almost always need cross-span context.
如果阶段1发现多步骤Trace(≥2种Span类型,或
agent
根Span下的任何
tool
/
retrieval
/
workflow
Span),在最终确定套件前必须检查以下四个规范Trace范围用例。针对每个用例,明确决定:适用(包含
eval_scope: trace
)或不适用(在提案的“跳过的Trace范围候选”小节中记录一行理由)。如果没有逐项理由就跳过所有四个用例,表明你过度依赖Span范围——重新检查。
规范用例触发条件
goal_completion
— 代理是否完成了用户请求?
任何代理/多步骤应用。几乎总是适用。
tool_use_correctness
— 是否使用了正确的工具和参数?
Trace包含
tool
类型Span。
rag_faithfulness
— 回答是否基于检索到的文档?
Trace包含
retrieval
类型Span。
conversation_quality
— 多轮LLM调用的连贯性?
Trace包含≥2个
llm
Span,或应用工具支持多轮会话。
对于其他拟议评估器(例如语气、格式、安全),应用以下两个问题的测试:
  1. 判断结果是否可以仅从一个Span的
    meta.input
    +
    meta.output
    正确得出,其中“正确”意味着考虑Trace中的其他Span不会改变判断结果? →
    eval_scope: span
  2. 否则 →
    eval_scope: trace
    。特别是,当评估器名称包含groundingfaithfulnesshallucinationcompleteness跨步骤正确性consistencyworkflow时,默认使用Trace范围——这些几乎总是需要跨Span上下文。
Trade-offs (don't let these dominate the choice)
权衡(不要让这些主导选择)
Trace scope costs more than span scope: one judgment per completed trace (vs. per matching span), larger prompt payloads, and a 3-minute trigger latency (Datadog waits 3 minutes of inactivity before considering a trace complete; later spans are excluded). These are cost-control levers — handle with
sampling_percentage
and
filter
, not by demoting scope. The correctness of the eval is what picks the scope.
Trace范围的成本高于Span范围:每个完成的Trace对应一次判断(vs每个匹配Span对应一次),提示负载更大,触发延迟为3分钟(Datadog等待3分钟无活动后才认为Trace完成;后续Span被排除)。这些是成本控制手段——通过
sampling_percentage
filter
处理,而非降低范围。评估的正确性才是选择范围的依据。
Surface the classification
显示分类结果
Add a Scope column to the proposal table and a one-sentence rationale per evaluator. If you skipped a canonical trace-scope use case, list it under a "Skipped trace-scope candidates" subsection with the reason — the user will see and can override.
Example rationales:
  • tone_check
    span. Judging "is this single response polite" needs only one LLM span's
    meta.output.messages[*].content
    ; no other span in the trace can change that verdict.
  • goal_completion
    trace. Whether the agent finished the user's request depends on the sequence of tool calls and the final LLM response together —
    meta.output
    of any single span only shows that step's output.
  • tool_use_correctness
    trace. Comparing tool inputs against the request and the final response requires correlating ≥ 3 spans (root, tool, final LLM).
  • rag_faithfulness
    trace. Grounding pairs the
    retrieval
    span's documents with the LLM span's answer.
Example "Skipped trace-scope candidates" entry:
  • conversation_quality
    — skipped: traces contain a single LLM call (no multi-turn signal in this app's instrumentation).
在提案表格中添加范围列,并为每个评估器添加一句理由。如果跳过了某个规范Trace范围用例,在“跳过的Trace范围候选”小节中列出并说明理由——用户会看到并可以覆盖。
示例理由:
  • tone_check
    Span范围。判断“此单个响应是否礼貌”仅需要一个LLM Span的
    meta.output.messages[*].content
    ;Trace中的其他Span不会改变该判断结果。
  • goal_completion
    Trace范围。代理是否完成用户请求取决于工具调用序列和最终LLM响应的组合;任何单个Span的
    meta.output
    仅显示该步骤的输出。
  • tool_use_correctness
    Trace范围。将工具输入与请求和最终响应进行比较需要关联≥3个Span(根Span、工具Span、最终LLM Span)。
  • rag_faithfulness
    Trace范围。忠实度需要将
    retrieval
    Span的文档与LLM Span的回答配对。
示例“跳过的Trace范围候选”条目:
  • conversation_quality
    — 跳过:Trace包含单个LLM调用(此应用的工具中无多轮信号)。

MANDATORY CHECKPOINT

强制检查点

You MUST output the proposal and wait for user confirmation before proceeding.
undefined
必须输出提案并等待用户确认后再继续
undefined

Proposed Evaluator Suite

拟议评估器套件

App profile: {LLM | RAG | Agent | Multi-agent} Entry mode: {cold_start | from_rca}
#NameTypeScopeMeasuresPass Criteria
1task_completionLLMJudge (Boolean)spanWhether the task was completed on this spanpass_when=True
2tool_use_correctnessLLMJudge (Categorical)traceRight tool with right arguments across the agent runpass_values=["correct"]
3...............
(Drop the Scope column when not in
publish
mode.)
For each evaluator:
  • {name}: {what it measures}
    • Target span: {which span's data it was designed for}
    • Rationale: {which quality dimension it covers and why}
    • {Only in publish mode:} Scope: {span | trace} — {one-sentence rationale}
    • Evidence: Trace {id_short}
{Only in publish mode, for multi-step apps. Required if any of the four canonical trace-scope use cases was not included above:}
Skipped trace-scope candidates:
  • {canonical_use_case}
    — {one-line reason it does not apply to this app}
{Only in publish mode, when the suite contains code-based evaluators (JSONEvaluator, RegexMatchEvaluator, LengthEvaluator, StringCheckEvaluator, BaseEvaluator). Required when any code-based proposal exists.}
Not publishable in this mode (code-based evaluators — the publish API is LLM-judge only):
  • {name}
    ({type}) — {what it would check}. Re-run
    /eval-bootstrap {ml_app}
    in default mode to emit as offline SDK code, or
    /eval-bootstrap {ml_app} --data-only
    for a framework-agnostic JSON spec.

**Which evaluators should I generate?** Treat the proposal as a candidate set — the suite below is intentionally broad so you can pick what matters for your team's quality bar. Reply with **which to keep, which to drop, and which to rename**; not every domain-specific proposal will fit your priorities. In `sdk_code` mode you may also add custom evaluators or change provider/model. In `publish` mode you may override `integration_provider`, `model_name`, `sampling_percentage`, `eval_scope`, `root_spans_only`, or `filter` per evaluator.

Do NOT proceed to code generation until the user confirms.

---
应用概况:{LLM | RAG | Agent | Multi-agent} 进入模式:{cold_start | from_rca}
#名称类型范围测量内容通过标准
1task_completionLLMJudge (Boolean)span此Span上的任务是否完成pass_when=True
2tool_use_correctnessLLMJudge (Categorical)trace代理运行过程中是否使用了正确的工具和参数pass_values=["correct"]
3...............
(非发布模式下删除范围列。)
针对每个评估器:
  • {name}:{测量内容}
    • 目标Span:{设计用于哪个Span的数据}
    • 理由:{覆盖的质量维度及原因}
    • {仅发布模式下:} 范围:{span | trace} — {一句理由}
    • 证据:Trace {id_short}
{仅发布模式下,针对多步骤应用。如果上述未包含四个规范Trace范围用例中的任何一个,必填:}
跳过的Trace范围候选
  • {canonical_use_case}
    — {不适用于此应用的一行理由}
{仅发布模式下,当套件包含基于代码的评估器(JSONEvaluator、RegexMatchEvaluator、LengthEvaluator、StringCheckEvaluator、BaseEvaluator)时。当存在任何基于代码的提案时必填:}
此模式下不可发布(基于代码的评估器——发布API仅支持LLM-judge):
  • {name}
    ({type}) — {它会检查的内容}。重新运行
    /eval-bootstrap {ml_app}
    的默认模式以生成为离线SDK代码,或运行
    /eval-bootstrap {ml_app} --data-only
    以生成框架无关的JSON规范。

**我应该生成哪些评估器?** 将提案视为候选集——下面的套件故意设计得很宽泛,以便你选择符合团队质量标准的评估器。回复**保留哪些、删除哪些、重命名哪些**;并非所有领域特定提案都符合你的优先级。在`sdk_code`模式下,你还可以添加自定义评估器或更改提供商/模型。在发布模式下,你可以针对每个评估器覆盖`integration_provider`、`model_name`、`sampling_percentage`、`eval_scope`、`root_spans_only`或`filter`。

在用户确认前,不要继续生成代码。

---

Phase 3: Generate Output

阶段3:生成输出

Branch on
output_mode
:
  • sdk_code
    Phase 3A below
  • data_only
    → skip to Phase 3B
  • publish
    → skip to Phase 3C

根据
output_mode
分支:
  • sdk_code
    → 下面的阶段3A
  • data_only
    → 跳至阶段3B
  • publish
    → 跳至阶段3C

Phase 3A: Generate & Write Evaluator Code

阶段3A:生成并写入评估器代码

Goal: Generate the final
.py
file and write it to disk.
For each confirmed evaluator, generate production-quality Python code following the SDK Reference patterns above.
目标:生成最终的
.py
文件并写入磁盘。
针对每个确认的评估器,按照上述SDK参考模式生成生产级Python代码。

Code Generation Rules

代码生成规则

  1. Ground prompts in traces: LLMJudge system prompts and user prompts must reference patterns actually observed in production traces. Never write generic prompts like "evaluate whether the response is good" — ground them in the app's domain, observed failure patterns, and success criteria.
  2. Keep template variables generic, add comments for context: Use
    {{input_data}}
    and
    {{output_data}}
    as top-level placeholders in prompts — do NOT reference nested span paths like
    {{input_data.messages[-1].content}}
    . The evaluator's data comes from the user's dataset and task function, not directly from spans. Instead, add a comment above each evaluator describing what data it was designed for and what the user should adapt:
    python
    # Designed for: input_data = user query, output_data = assistant response text
    # Observed from: root agent span (input.value → output.value)
    # If your dataset uses a different structure, adapt the prompt references below.
  3. Use the narrowest evaluator type: If a check can be done with
    JSONEvaluator
    ,
    RegexMatchEvaluator
    ,
    StringCheckEvaluator
    , or
    LengthEvaluator
    , do NOT use an LLMJudge. Code-based evaluators are faster, cheaper, and deterministic.
  4. BaseEvaluator subclasses:
    • Call
      super().__init__(name=name)
      in
      __init__
    • Return
      EvaluatorResult
      from
      evaluate()
    • Do NOT modify instance attributes in
      evaluate()
      (thread safety)
  5. Names: Must match
    ^[a-zA-Z0-9_-]+$
    . Use snake_case descriptive names.
  6. Imports: Consolidate at the top of the file. Only import classes that are actually used.
  7. Evaluator list: Collect all evaluators into an
    evaluators
    list at the bottom of the file.
  8. Anonymize PII: Strip emails, names, and sensitive data from any trace content included in LLMJudge prompts or the header comment.
  1. 基于Trace编写提示:LLMJudge系统提示和用户提示必须引用生产Trace中实际观察到的模式。切勿编写通用提示,例如“评估响应是否良好”——要基于应用领域、观察到的故障模式和成功标准编写。
  2. 保持模板变量通用,添加注释说明上下文:在提示中使用
    {{input_data}}
    {{output_data}}
    作为顶级占位符——不要引用嵌套Span路径,例如
    {{input_data.messages[-1].content}}
    。评估器的数据来自用户的数据集和任务函数,而非直接来自Span。相反,在每个评估器上方添加注释,说明其设计用于何种数据以及用户应如何适配:
    python
    # 设计用途:input_data = 用户查询,output_data = 助手响应文本
    # 来自:根代理Span(input.value → output.value)
    # 如果你的数据集使用不同结构,请适配下面的提示引用。
  3. 使用最窄的评估器类型:如果检查可以通过
    JSONEvaluator
    RegexMatchEvaluator
    StringCheckEvaluator
    LengthEvaluator
    完成,不要使用LLMJudge。基于代码的评估器更快、更便宜且具有确定性。
  4. BaseEvaluator子类
    • __init__
      中调用
      super().__init__(name=name)
    • evaluate()
      返回
      EvaluatorResult
    • 不要在
      evaluate()
      中修改实例属性(线程安全)
  5. 名称:必须匹配
    ^[a-zA-Z0-9_-]+$
    。使用蛇形命名法的描述性名称。
  6. 导入:在文件顶部合并导入。仅导入实际使用的类。
  7. 评估器列表:在文件底部将所有评估器收集到
    evaluators
    列表中。
  8. 匿名化PII:从LLMJudge提示或头部注释中包含的任何Trace内容中剥离电子邮件、姓名和敏感数据。

Output Format

输出格式

The generated
.py
file should follow this structure:
python
"""
Auto-generated evaluators for {ml_app}
Generated: {YYYY-MM-DD} by eval-bootstrap

App profile: {LLM | RAG | Agent | Multi-agent}

Quality dimensions covered:
  - {target_name}: {description}
    Evidence: https://app.datadoghq.com/llm/traces?query=trace_id:{full_id}
  ...

Usage:
    from ddtrace.llmobs import LLMObs

    experiment = LLMObs.experiment(
        name="my-experiment",
        task=my_task_fn,
        dataset=dataset,
        evaluators=evaluators,
    )
    experiment.run()
"""

{imports — only what is used}
生成的
.py
文件应遵循以下结构:
python
"""
为{ml_app}自动生成的评估器
生成时间:{YYYY-MM-DD} by eval-bootstrap

应用概况:{LLM | RAG | Agent | Multi-agent}

覆盖的质量维度:
  - {target_name}: {description}
    证据:https://app.datadoghq.com/llm/traces?query=trace_id:{full_id}
  ...

使用方法:
    from ddtrace.llmobs import LLMObs

    experiment = LLMObs.experiment(
        name="my-experiment",
        task=my_task_fn,
        dataset=dataset,
        evaluators=evaluators,
    )
    experiment.run()
"""

{导入——仅导入使用的类}

--- Outcome Evaluators ---

--- 结果评估器 ---

{evaluator code}
{评估器代码}

--- Format Evaluators ---

--- 格式评估器 ---

{evaluator code}
{评估器代码}

--- Safety Evaluators ---

--- 安全评估器 ---

{evaluator code}
{评估器代码}

--- Evaluator Suite ---

--- 评估器套件 ---

evaluators = [ {eval_1_variable_name}, {eval_2_variable_name}, ... ]

Only include section comments (Outcome/Format/Safety) for categories that have evaluators.
evaluators = [ {eval_1_variable_name}, {eval_2_variable_name}, ... ]

仅为包含评估器的类别添加节注释(结果/格式/安全)。

Write the file

写入文件

Write the generated code to the output path (suggest
./evals/{ml_app}_evaluators.py
if not specified), then display a summary:
undefined
将生成的代码写入输出路径(如果未指定,建议使用
./evals/{ml_app}_evaluators.py
),然后显示摘要:
undefined

Generated Evaluators

生成的评估器

Wrote {N} evaluators to
{output_path}
:
#NameTypeCovers
1.........
已将{N}个评估器写入
{output_path}
#名称类型覆盖内容
1.........

Next Steps

后续步骤

  1. Review: Check the generated prompts and criteria match your expectations
  2. Test offline: Use
    LLMObs.experiment(evaluators=evaluators)
    to batch-evaluate against a labeled dataset and verify scores
undefined
  1. 审查:检查生成的提示和标准是否符合你的预期
  2. 离线测试:使用
    LLMObs.experiment(evaluators=evaluators)
    针对标记数据集批量评估并验证分数
undefined

Notebook export (after summary)

Notebook导出(摘要后)

After displaying the summary, offer notebook export.
  • If
    rca_notebook_url
    was detected in Phase 0
    :
    An RCA notebook was created earlier in this session:
    {rca_notebook_url}
    Would you like to (a) append the evaluator suite summary to that notebook, or (b) create a new standalone notebook?
    If append: use the notebook creation fallback pattern (see below) with
    mcp__datadog-mcp__edit_datadog_notebook
    (
    id={rca_notebook_id}
    ,
    append_only=true
    , evaluator suite summary cell).
    If new: use the notebook creation fallback pattern (see below) with
    mcp__datadog-mcp__create_datadog_notebook
    .
  • If no
    rca_notebook_url
    :
    Would you like to export this evaluator suite summary to a Datadog notebook?
    If yes: use the notebook creation fallback pattern (see below) with
    mcp__datadog-mcp__create_datadog_notebook
    :
    • name
      :
      Eval Bootstrap: {ml_app} — YYYY-MM-DD
    • type
      :
      report
    • cells
      : single markdown cell with the evaluator suite summary
    • time
      :
      { "live_span": "1h" }
Notebook creation fallback pattern (apply to every
create_datadog_notebook
/
edit_datadog_notebook
call):
  1. Try the MCP tool first.
  2. If the MCP call fails, inspect the error:
    • Auth / permission error (401, 403) → stop and tell the user.
    • Field validation error (error names a specific field) → fix that field and retry the MCP call once.
    • Any other error (binding, serialization, unexpected response) → fall back to pup:
      • Write the payload to
        /tmp/nb_bootstrap_{ml_app}.json
        as a full API envelope:
        {"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}}
      • Run
        pup notebooks create --file /tmp/nb_bootstrap_{ml_app}.json
      • If pup is not available either, render the notebook content as markdown in chat.
  3. After successful creation by either method, output the URL:
    Evaluator suite exported to notebook: <url>
Notebook cell content — the markdown cell should contain:
markdown
undefined
显示摘要后,提供Notebook导出选项。
  • 如果阶段0检测到
    rca_notebook_url
    本次会话中之前创建了一个RCA Notebook:
    {rca_notebook_url}
    你想要(a) 将评估器套件摘要附加到该Notebook,还是(b) 创建新的独立Notebook?
    如果选择附加:使用Notebook创建回退模式(如下所述),调用
    mcp__datadog-mcp__edit_datadog_notebook
    id={rca_notebook_id}
    append_only=true
    ,评估器套件摘要单元格)。
    如果选择新建:使用Notebook创建回退模式(如下所述),调用
    mcp__datadog-mcp__create_datadog_notebook
  • 如果未检测到
    rca_notebook_url
    是否要将此评估器套件摘要导出到Datadog Notebook?
    如果是:使用Notebook创建回退模式(如下所述),调用
    mcp__datadog-mcp__create_datadog_notebook
    • name
      Eval Bootstrap: {ml_app} — YYYY-MM-DD
    • type
      report
    • cells
      :包含评估器套件摘要的单个markdown单元格
    • time
      { "live_span": "1h" }
Notebook创建回退模式(适用于每个
create_datadog_notebook
/
edit_datadog_notebook
调用):
  1. 首先尝试MCP工具。
  2. 如果MCP调用失败,检查错误:
    • 认证/权限错误(401、403) → 停止并告知用户。
    • 字段验证错误(错误指出特定字段)→ 修复该字段并重试MCP调用一次。
    • 任何其他错误(绑定、序列化、意外响应)→ 回退到pup:
      • 将负载写入
        /tmp/nb_bootstrap_{ml_app}.json
        ,作为完整API包:
        {"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}}
      • 运行
        pup notebooks create --file /tmp/nb_bootstrap_{ml_app}.json
      • 如果pup也不可用,在聊天中渲染Notebook内容为markdown。
  3. 通过任一方法成功创建后,输出URL:
    评估器套件已导出到Notebook:<url>
Notebook单元格内容 — markdown单元格应包含:
markdown
undefined

Eval Bootstrap: {ml_app}

Eval Bootstrap: {ml_app}

Generated: YYYY-MM-DD | App profile: {LLM | RAG | Agent | Multi-agent} | Entry mode: {cold_start | from_rca} Generated code:
{output_path}
{One sentence: what does this app do?}
Coverage: {N} new evaluators ({comma-separated dimension names}) | {N} existing (unchanged: {names}) | {gaps if any: dimensions identified but not covered, and why}
生成时间:YYYY-MM-DD | 应用概况:{LLM | RAG | Agent | Multi-agent} | 进入模式:{cold_start | from_rca} 生成代码
{output_path}
{一句话:此应用的功能是什么?}
覆盖范围:{N}个新评估器({逗号分隔的维度名称}) | {N}个现有评估器(未更改:{名称}) | {如果有差距:已识别但未覆盖的维度及原因}

Evaluator Suite

评估器套件

#NameTypeMeasuresPass Criteria
1............
#名称类型测量内容通过标准
1............

Evidence

证据

{For each evaluator: name — 1-line description — [Trace link]}
{针对每个评估器:名称 — 一行描述 — [Trace链接]}

Next Steps

后续步骤

  1. Review generated prompts in
    {output_path}
  2. Run against a labeled dataset to validate scores
  3. Deploy to Datadog LLM Experiments

---
  1. 审查
    {output_path}
    中生成的提示
  2. 针对标记数据集运行以验证分数
  3. 部署到Datadog LLM Experiments

---

Phase 3B: Generate & Write Eval Spec JSON

阶段3B:生成并写入评估规范JSON

Goal: Serialize the confirmed evaluator suite and representative trace samples to a single self-contained JSON file — zero SDK dependencies.
Output path:
./evals/{ml_app}_eval_spec.json
目标:将确认的评估器套件和代表性Trace样本序列化为单个独立的JSON文件——无SDK依赖。
输出路径
./evals/{ml_app}_eval_spec.json

JSON Schema

JSON Schema

json
{
  "schema_version": "1",
  "generated_at": "<ISO 8601 UTC>",
  "generated_by": "eval-bootstrap",
  "app": {
    "ml_app": "<string>",
    "app_type": "LLM | RAG | Agent | Multi-agent",
    "trace_window": "<timeframe param, e.g. now-7d>",
    "trace_count": "<integer>"
  },
  "evaluators": [
    {
      "name": "snake_case_name",
      "category": "outcome | format | safety",
      "type": "llm_judge | code_check",
      "description": "<1-2 sentence plain-language description>",
      "target_span": "<which span: root, llm sub-span, etc.>",
      "scoring": {
        "scale": "boolean | score_1_10 | categorical",
        "categories": ["<only present when scale=categorical>"],
        "pass_criteria": "<human-readable: true, >= 7, in [correct], etc.>"
      },
      "rubric": "<full prompt text for llm_judge; null for code_check>",
      "implementation_hints": {
        "type_if_code_check": "json_valid | regex | contains | length_words | null",
        "pattern_if_code_check": "<pattern string or null>",
        "notes": "<optional framework-agnostic implementation guidance>"
      },
      "evidence": [
        {
          "trace_id": "<32-char hex>",
          "span_id": "<16-char hex>",
          "url": "https://app.datadoghq.com/llm/traces?query=trace_id:<trace_id>",
          "observation": "<why this trace illustrates the evaluator>"
        }
      ]
    }
  ],
  "sample_records": [
    {
      "trace_id": "<string>",
      "span_id": "<string>",
      "input": {},
      "output": "<string>",
      "suggested_labels": {
        "<evaluator_name>": "pass | fail | <score>"
      }
    }
  ]
}
json
{
  "schema_version": "1",
  "generated_at": "<ISO 8601 UTC>",
  "generated_by": "eval-bootstrap",
  "app": {
    "ml_app": "<string>",
    "app_type": "LLM | RAG | Agent | Multi-agent",
    "trace_window": "<timeframe参数,例如now-7d>",
    "trace_count": "<integer>"
  },
  "evaluators": [
    {
      "name": "snake_case_name",
      "category": "outcome | format | safety",
      "type": "llm_judge | code_check",
      "description": "<1-2句通俗易懂的描述>",
      "target_span": "<哪个Span:根Span、llm子Span等>",
      "scoring": {
        "scale": "boolean | score_1_10 | categorical",
        "categories": ["<仅当scale=categorical时存在>"],
        "pass_criteria": "<人类可读:true, >= 7, in [correct]等>"
      },
      "rubric": "<llm_judge的完整提示文本;code_check为null>",
      "implementation_hints": {
        "type_if_code_check": "json_valid | regex | contains | length_words | null",
        "pattern_if_code_check": "<模式字符串或null>",
        "notes": "<可选的框架无关实现指导>"
      },
      "evidence": [
        {
          "trace_id": "<32字符十六进制>",
          "span_id": "<16字符十六进制>",
          "url": "https://app.datadoghq.com/llm/traces?query=trace_id:<trace_id>",
          "observation": "<此Trace如何说明评估器>"
        }
      ]
    }
  ],
  "sample_records": [
    {
      "trace_id": "<string>",
      "span_id": "<string>",
      "input": {},
      "output": "<string>",
      "suggested_labels": {
        "<evaluator_name>": "pass | fail | <score>"
      }
    }
  ]
}

Field Notes

字段说明

  • evaluators[].type
    :
    "llm_judge"
    for semantic evaluators;
    "code_check"
    for deterministic checks (regex, length, JSON validity, etc.).
  • evaluators[].rubric
    : For
    llm_judge
    — full prompt text grounded in observed trace patterns. Use
    {{input}}
    and
    {{output}}
    as generic placeholders (not
    {{input_data}}
    — that's ddeval-specific). For
    code_check
    — null.
  • evaluators[].implementation_hints.notes
    : Optional framework-agnostic guidance, e.g. "For OpenAI Evals, use
    rubric
    as a model-graded criterion. For Braintrust, use as an LLM scorer. For Promptfoo, use as an
    llm-rubric
    assertion."
  • sample_records
    : 10–20 representative traces from Phase 1.
    suggested_labels
    are Claude's best-read from trace inspection — not ground truth. The field name communicates this explicitly.
  • PII rule: Strip emails, names, and sensitive data from all
    input
    ,
    output
    , and
    evidence[].observation
    fields before writing (same as Phase 3A).
  • evaluators[].type
    "llm_judge"
    用于语义评估器;
    "code_check"
    用于确定性检查(正则、长度、JSON有效性等)。
  • evaluators[].rubric
    :对于
    llm_judge
    ——基于观察到的Trace模式的完整提示文本。使用
    {{input}}
    {{output}}
    作为通用占位符(而非
    {{input_data}}
    ——这是ddeval特定的)。对于
    code_check
    ——null。
  • evaluators[].implementation_hints.notes
    :可选的框架无关指导,例如“对于OpenAI Evals,使用
    rubric
    作为模型评分标准。对于Braintrust,将其用作LLM评分器。对于Promptfoo,将其用作
    llm-rubric
    断言。”
  • sample_records
    :来自阶段1的10-20个代表性Trace。
    suggested_labels
    是Claude通过Trace检查得出的最佳猜测——并非基准真值。字段名称明确传达了这一点。
  • PII规则:在写入前,从所有
    input
    output
    evidence[].observation
    字段中剥离电子邮件、姓名和敏感数据(与阶段3A相同)。

Writing Instructions

写入说明

  1. Assemble the JSON object in memory following the schema above.
  2. Populate
    sample_records
    from traces already fetched in Phase 1. Fetch additional traces (up to 20 total) if fewer than 10 were read.
  3. Anonymize PII in all
    input
    ,
    output
    , and
    evidence[].observation
    fields.
  4. Write the file with 2-space indentation using the Write tool.
  5. Display a completion summary:
undefined
  1. 在内存中按照上述schema组装JSON对象。
  2. 从阶段1已获取的Trace中填充
    sample_records
    。如果读取的Trace少于10个,获取额外的Trace(最多20个)。
  3. 匿名化所有
    input
    output
    evidence[].observation
    字段中的PII。
  4. 使用Write工具写入文件,使用2空格缩进。
  5. 显示完成摘要:
undefined

Generated Eval Spec

生成的评估规范

Wrote
./evals/{ml_app}_eval_spec.json
:
  • {N} evaluators ({outcome_count} outcome, {format_count} format, {safety_count} safety)
  • {M} sample records with suggested labels
#NameCategoryTypePass Criteria
1............
已写入
./evals/{ml_app}_eval_spec.json
  • {N}个评估器({outcome_count}个结果类,{format_count}个格式类,{safety_count}个安全类)
  • {M}个样本记录,包含建议标签
#名称类别类型通过标准
1............

Next Steps

后续步骤

  1. Review: Open
    ./evals/{ml_app}_eval_spec.json
    and verify the rubrics match your expectations
  2. Implement: Use the
    rubric
    field to configure evaluators in your framework of choice:
    • OpenAI Evals: use
      rubric
      as a model-graded criterion
    • Braintrust: create an LLM scorer with the rubric text
    • Promptfoo: use as an
      llm-rubric
      assertion
    • Custom code: call your LLM API with the rubric and parse the structured output
  3. Label:
    suggested_labels
    are Claude's best guesses from trace inspection — verify against ground truth before using as training data
undefined
  1. 审查:打开
    ./evals/{ml_app}_eval_spec.json
    并验证评估准则是否符合你的预期
  2. 实现:使用
    rubric
    字段在你选择的框架中配置评估器:
    • OpenAI Evals:将
      rubric
      用作模型评分标准
    • Braintrust:使用评估准则文本创建LLM评分器
    • Promptfoo:将其用作
      llm-rubric
      断言
    • 自定义代码:使用评估准则调用LLM API并解析结构化输出
  3. 标记
    suggested_labels
    是Claude通过Trace检查得出的最佳猜测——在用作训练数据前,针对基准真值进行验证
undefined

Notebook export (after summary)

Notebook导出(摘要后)

Same logic as Phase 3A — offer to append to the RCA notebook if
rca_notebook_url
was detected, or create a new standalone notebook. Use the same notebook cell format as Phase 3A, substituting
output_path
with the JSON spec file path. In pup mode, use
pup notebooks create
/
pup notebooks edit
as described in Phase 3A.

与阶段3A逻辑相同——如果检测到
rca_notebook_url
,提供附加到RCA Notebook的选项,否则创建新的独立Notebook。使用与阶段3A相同的Notebook单元格格式,将
output_path
替换为JSON规范文件路径。在pup模式下,按照阶段3A中的描述使用
pup notebooks create
/
pup notebooks edit

Phase 3C: Publish Online Evaluators to Datadog

阶段3C:将在线评估器发布到Datadog

Goal: For each confirmed evaluator, write an LLM-judge configuration to Datadog via
create_or_update_llmobs_evaluator
so it runs automatically on matching production spans.
目标:针对每个确认的评估器,通过
create_or_update_llmobs_evaluator
将LLM-judge配置写入Datadog,使其自动在匹配的生产Span上运行。

Pre-publish checks (single message — parallelize)

发布前检查(单个消息——并行化)

For every proposed
eval_name
, call
get_llmobs_evaluator(eval_name=...)
:
  • Not found → safe to create.
  • Found → existing evaluator with the same name. Surface a diff to the user (existing dimension/prompt vs. proposed) and ask:
    Evaluator
    {name}
    already exists. Overwrite, rename, or skip?
    If overwrite: keep the fetched config as the base and merge your generated fields on top, then send the complete object back. The MCP tool is full-replace — any field you omit (e.g.
    temperature
    ,
    max_tokens
    ,
    filter
    ,
    sampling_percentage
    ) reverts to its default. Never re-publish without round-tripping the existing config.
    If rename: append a suffix (e.g.
    _v2
    ) and treat as new.
    If skip: drop from the publish set.
针对每个拟议的
eval_name
,调用
get_llmobs_evaluator(eval_name=...)
  • 未找到 → 可以安全创建。
  • 找到 → 存在同名的现有评估器。向用户显示差异(现有维度/提示与拟议内容)并询问:
    评估器
    {name}
    已存在。是否覆盖、重命名或跳过?
    如果选择覆盖:将获取的配置作为基础,合并生成的字段,然后将完整对象返回。MCP工具是全替换模式——任何省略的字段(例如
    temperature
    max_tokens
    filter
    sampling_percentage
    )都会重置为默认值。切勿不往返现有配置就重新发布。
    如果选择重命名:添加后缀(例如
    _v2
    )并视为新评估器。
    如果选择跳过:从发布集中移除。

Publishing Conventions

发布约定

Required parameters for each
create_or_update_llmobs_evaluator
call:
eval_name
,
application_name
(=
ml_app
),
enabled
,
integration_provider
,
model_name
,
prompt_template
,
parsing_type
,
output_schema
, plus a
telemetry.intent
string.
Defaults to use unless the user overrides:
FieldDefault
enabled
false
(always — see "Always publish as draft")
integration_provider
openai
model_name
gpt-5.4-mini
temperature
0
parsing_type
structured_output
sampling_percentage
10
for span scope,
5
for trace scope
eval_scope
span
(auto-promoted to
trace
per the classification rule in Phase 2)
Prompt template: convert the LLMJudge prompt into the MCP shape — an ordered array of
{role, content}
messages. The system prompt becomes
{role: "system"}
, the user prompt becomes
{role: "user"}
. Use span-data placeholders (see below) — not the offline
{{input_data}}
/
{{output_data}}
form, which only exists in
EvaluatorContext
.
每个
create_or_update_llmobs_evaluator
调用的必填参数
eval_name
application_name
(=
ml_app
)、
enabled
integration_provider
model_name
prompt_template
parsing_type
output_schema
,以及
telemetry.intent
字符串。
默认值,除非用户覆盖:
字段默认值
enabled
false
(始终如此——请参阅“始终以草稿形式发布”)
integration_provider
openai
model_name
gpt-5.4-mini
temperature
0
parsing_type
structured_output
sampling_percentage
Span范围为
10
,Trace范围为
5
eval_scope
span
(根据阶段2的分类规则自动升级为
trace
提示模板:将LLMJudge提示转换为MCP格式——有序的
{role, content}
消息数组。系统提示变为
{role: "system"}
,用户提示变为
{role: "user"}
。使用Span数据占位符(如下所述)——不要使用离线的
{{input_data}}
/
{{output_data}}
形式,这仅存在于
EvaluatorContext
中。
Online Template Variables
在线模板变量
Online evaluator prompts run through the dd-source
template
library (
domains/ml-observability/shared/libs/template
). Missing paths → empty string. The data shape templates resolve against depends on
eval_scope
:
  • eval_scope: span
    (default) — placeholders resolve against a single span's JSON (the
    llmobs.Span
    JSON-marshaled to a map). Use the span aliases / dot-paths below directly.
  • eval_scope: trace
    — placeholders resolve against the trace payload
    { spans: [...] }
    . Use
    {{spans[N]...}}
    ,
    {{spans[*]...}}
    , or
    {{spans[field.path:value]...}}
    to select span(s) before applying field paths. The
    {{span_input}}
    /
    {{span_output}}
    aliases are not available in trace scope — reference span data through the
    spans
    array instead.
  • eval_scope: session
    — not supported by this skill; classify as
    span
    and surface the limitation to the user.
在线评估器提示通过dd-source
template
库(
domains/ml-observability/shared/libs/template
)解析。路径不存在→空字符串。模板解析的数据形状取决于
eval_scope
  • eval_scope: span
    (默认)——占位符针对单个Span的JSON
    llmobs.Span
    序列化为映射)解析。直接使用下面的Span别名/点路径。
  • eval_scope: trace
    ——占位符针对Trace负载
    { spans: [...] }
    解析。使用
    {{spans[N]...}}
    {{spans[*]...}}
    {{spans[field.path:value]...}}
    选择Span,然后应用字段路径。
    {{span_input}}
    /
    {{span_output}}
    别名在Trace范围中不可用——通过
    spans
    数组引用Span数据。
  • eval_scope: session
    ——本技能不支持;分类为
    span
    并向用户说明限制。
Span-scope (
eval_scope: span
)
Span范围(
eval_scope: span
Built-in span-kind-aware aliases (preferred when the evaluator is generic across span kinds):
AliasLLM span (
meta.span.kind = "llm"
)
Other spans (agent, workflow, task, …)
{{span_input}}
meta.input.messages[*].content
meta.input.value
{{span_output}}
meta.output.messages[*].content
meta.output.value
Common explicit dot-paths (use when the evaluator is purpose-built for one span kind):
PathWhat you get
{{meta.input.value}}
/
{{meta.output.value}}
Plain string I/O on agent / workflow / task / tool spans
{{meta.input.messages[*].content}}
All input message contents on an LLM span (newline-joined)
{{meta.input.messages[0].content}}
First message (typically system prompt)
{{meta.output.messages[*].content}}
Assistant response(s)
{{meta.input.documents}}
Retrieved docs (RAG) — JSON-serialized
{{meta.metadata.<key>}}
Custom metadata fields
{{meta.tool_definitions}}
Available tools — JSON array
{{*}}
Entire span as compact JSON (debug / fall-back catch-all)
内置的Span类型感知别名(当评估器跨Span类型通用时优先使用):
别名LLM Span(
meta.span.kind = "llm"
其他Span(agent、workflow、task等)
{{span_input}}
meta.input.messages[*].content
meta.input.value
{{span_output}}
meta.output.messages[*].content
meta.output.value
常见显式点路径(当评估器专为一种Span类型设计时使用):
路径获取内容
{{meta.input.value}}
/
{{meta.output.value}}
agent/workflow/task/tool Span的纯字符串输入/输出
{{meta.input.messages[*].content}}
LLM Span的所有输入消息内容(换行连接)
{{meta.input.messages[0].content}}
第一条消息(通常为系统提示)
{{meta.output.messages[*].content}}
助手响应
{{meta.input.documents}}
检索到的文档(RAG)——JSON序列化
{{meta.metadata.<key>}}
自定义元数据字段
{{meta.tool_definitions}}
可用工具——JSON数组
{{*}}
整个Span的紧凑JSON(调试/回退兜底)
Trace-scope (
eval_scope: trace
)
Trace范围(
eval_scope: trace
PatternWhat you get
{{spans}}
JSON of every span in the trace
{{spans[N].meta.input.value}}
Single span by index —
spans[0]
is the trace root
{{spans[*].name}}
All span names in order, newline-joined
{{spans[*].meta.output.value}}
All spans' outputs, newline-joined (handy for "final answer = last output")
{{spans[name:my-span].meta.input.value}}
Filter by span name
{{spans[meta.span.kind:llm].meta.output.value}}
All LLM-kind span outputs
{{spans[meta.span.kind:tool]}}
Whole tool spans as JSON, paired in/out — useful for tool-use correctness
{{spans[meta.span.kind:retrieval].meta.output.documents[*].text}}
Text of every retrieved document — useful for RAG faithfulness
{{*}}
Entire trace payload as JSON (debug fallback)
模式获取内容
{{spans}}
Trace中所有Span的JSON
{{spans[N].meta.input.value}}
通过索引选择单个Span——
spans[0]
是Trace根Span
{{spans[*].name}}
所有Span名称按顺序排列,换行连接
{{spans[*].meta.output.value}}
所有Span的输出,换行连接(适用于“最终答案=最后一个输出”)
{{spans[name:my-span].meta.input.value}}
按Span名称过滤
{{spans[meta.span.kind:llm].meta.output.value}}
所有LLM类型Span的输出
{{spans[meta.span.kind:tool]}}
完整的工具Span JSON,包含输入/输出——适用于工具使用正确性
{{spans[meta.span.kind:retrieval].meta.output.documents[*].text}}
所有检索到的文档文本——适用于RAG忠实度
{{*}}
整个Trace负载的JSON(调试回退)
Array selector syntax (applies to both scopes)
数组选择器语法(适用于两种范围)
  • [N]
    — index (0-based)
  • [START,END]
    — inclusive range,
    END
    is clamped to slice length
  • [*]
    — wildcard (fan-out over all elements)
  • [field.path:value]
    — filter array elements by a nested field equality, e.g.
    messages[role:user]
    or
    spans[meta.span.kind:tool]
Resolution rules to keep in mind when writing prompts:
  • Arrays of strings → newline-joined
  • Arrays of objects / mixed values → compact JSON
  • Single empty slice → empty string
  • Implicit fan-out:
    messages.content
    behaves the same as
    messages[*].content
  • Negative indices are not supported (parse error) — use
    [N]
    with a known index, or
    [*]
    for "last assistant turn" semantics
When to pick which form:
  • Generic span evaluator (e.g.
    tone_check
    ,
    output_format
    ) → use
    {{span_input}}
    /
    {{span_output}}
    so it works across span kinds.
  • LLM-span-specific evaluator (e.g.
    system_prompt_adherence
    ) → reach for explicit
    meta.input.messages[*].content
    /
    meta.output.messages[*].content
    so you can split system vs. user vs. assistant turns.
  • Span-scope RAG evaluator (single retrieval+generation span) → combine
    {{meta.input.documents}}
    with
    {{span_output}}
    .
  • Trace-scope evaluator → see "Trace-scope evaluator examples" below for the four canonical patterns (goal completion, tool-use correctness, RAG faithfulness, conversation quality).
  • Metadata-aware evaluator → reference
    {{meta.metadata.<key>}}
    directly.
If the user has existing custom evaluators in the same ml_app (Phase 0 coverage map), match their convention when there is no strong reason to deviate.
  • [N]
    — 索引(从0开始)
  • [START,END]
    — 包含范围,
    END
    被钳制为切片长度
  • [*]
    — 通配符(遍历所有元素)
  • [field.path:value]
    — 按嵌套字段相等性过滤数组元素,例如
    messages[role:user]
    spans[meta.span.kind:tool]
编写提示时需记住的解析规则
  • 字符串数组→换行连接
  • 对象数组/混合值→紧凑JSON
  • 单个空切片→空字符串
  • 隐式遍历:
    messages.content
    messages[*].content
    行为相同
  • 不支持负索引(解析错误)——使用
    [N]
    指定已知索引,或使用
    [*]
    实现“最后一个助手轮次”语义
何时选择哪种形式
  • 通用Span评估器(例如
    tone_check
    output_format
    )→ 使用
    {{span_input}}
    /
    {{span_output}}
    ,使其跨Span类型工作。
  • LLM Span特定评估器(例如
    system_prompt_adherence
    )→ 使用显式的
    meta.input.messages[*].content
    /
    meta.output.messages[*].content
    ,以便区分系统/用户/助手轮次。
  • Span范围RAG评估器(单个检索+生成Span)→ 组合
    {{meta.input.documents}}
    {{span_output}}
  • Trace范围评估器→ 请参阅下面的“Trace范围评估器示例”,了解四个规范模式(目标完成、工具使用正确性、RAG忠实度、对话质量)。
  • 元数据感知评估器→ 直接引用
    {{meta.metadata.<key>}}
如果同一ml_app中存在现有自定义评估器(阶段0覆盖范围映射),在没有充分理由偏离时匹配其约定。
Trace-scope evaluator examples
Trace范围评估器示例
Concrete user-prompt bodies for the four canonical trace-scope use cases, drawn from the public docs (Trace-Level Evaluations). Each goes alongside a static System prompt that describes the rubric (no placeholders).
Use case
filter
User prompt body
Goal completion — agent finished the user's request
@parent_id:undefined @meta.span.kind:agent
User goal:\n{{spans[0].meta.input.value}}\n\nAgent steps:\n{{spans}}
Tool-use correctness — right tool with right arguments
@parent_id:undefined @meta.span.kind:agent
User question:\n{{spans[0].meta.input.value}}\n\nTool calls:\n{{spans[meta.span.kind:tool].meta.input.parameters}}\n\nFinal response:\n{{spans[*].meta.output.value}}
RAG faithfulness — answer grounded in retrieved docs
@parent_id:undefined
Retrieved context:\n{{spans[meta.span.kind:retrieval].meta.output.documents[*].text}}\n\nFinal answer:\n{{spans[meta.span.kind:llm].meta.output.value}}
Conversation quality — coherence and consistency across turns
@parent_id:undefined
Conversation:\n{{spans[meta.span.kind:llm].meta.input.messages[*].content}}\n\nAssistant responses:\n{{spans[meta.span.kind:llm].meta.output.messages[*].content}}
Use these as starting points. Adapt the
filter
and span paths to the actual span names / kinds the app emits (observed during Phase 1).
output_schema
wrapper format (required for all providers)
The
output_schema
field is NOT a bare JSON Schema. It must use the OpenAI
json_schema
object shape.
name
is a fixed type discriminator
, not the evaluator name — the UI validates it against a strict allowlist and rejects any other value:
LLMJudge type
name
value
property key inside
schema
Boolean
"boolean_eval"
boolean_eval
Score
"score_eval"
score_eval
Categorical
"categorical_eval"
categorical_eval
The property key inside
schema.properties
must match
name
exactly. The
required
array may only be
["<type_key>"]
or
["<type_key>", "reasoning"]
— any other value is rejected. Always include
"reasoning": {"type": "string"}
for UI display.
Boolean (
BooleanStructuredOutput(pass_when=True)
):
json
{
  "output_schema": {
    "name": "boolean_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "boolean_eval": {"type": "boolean", "description": "Whether the criterion is met"},
        "reasoning": {"type": "string", "description": "Explanation for the evaluation"}
      },
      "required": ["boolean_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"pass_when": true}
}
Score (
ScoreStructuredOutput(min_score=1, max_score=10, min_threshold=7)
):
json
{
  "output_schema": {
    "name": "score_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "score_eval": {"type": "number", "description": "Score from 1 to 10", "minimum": 1, "maximum": 10},
        "reasoning": {"type": "string", "description": "Explanation for the score"}
      },
      "required": ["score_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"min_threshold": 7}
}
Add
max_threshold
to
assessment_criteria
if set.
Categorical (
CategoricalStructuredOutput(categories={...}, pass_values=[...])
):
json
{
  "output_schema": {
    "name": "categorical_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "categorical_eval": {
          "type": "string",
          "anyOf": [
            {"const": "correct", "description": "The response correctly answers the question"},
            {"const": "partially_correct", "description": "Partially correct but missing information"},
            {"const": "incorrect", "description": "The response is wrong or irrelevant"}
          ]
        },
        "reasoning": {"type": "string", "description": "Explanation for the category chosen"}
      },
      "required": ["categorical_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"pass_values": ["correct"]}
}
Note: categorical uses
"type": "string"
alongside
anyOf
(each
const
is a string value), unlike the offline SDK which uses bare
anyOf
at the property root.
Custom / multi-dimensional: not directly supported via the fixed-name schema. Implement as a score or categorical evaluator where possible, or split into multiple evaluators. The
name
must be one of the three fixed values above.
Filter scoping: when the proposal targets a specific span kind (e.g. an LLM sub-span), translate it into an EVP
filter
query — e.g.
@meta.span.kind:llm
,
service:checkout-agent
, or a more specific tag. Combine with
root_spans_only:true
only when the target is the trace root.
For
eval_scope: trace
:
  • The evaluator triggers once per completed trace, after a 3-minute inactivity window. Late-arriving spans (>3 min after the prior span on the same trace) are excluded from the evaluation. Surface this in the proposal so the user knows about both the latency and the potential miss for sparse-activity agents (long-running agents whose steps are sparser than 3 minutes apart).
  • The
    filter
    query must match the trace's root span only — always include
    @parent_id:undefined
    (or
    root_spans_only: true
    ) to avoid double-firing across descendants. Combine with
    @meta.span.kind:agent
    (or whatever kind the app uses for root spans, observed in Phase 1) for narrowing.
  • Sampling at trace scope is heavier than at span scope (one trace = many spans on the judge's side). Default
    sampling_percentage
    to
    5
    for trace-scope evaluators (instead of the span default
    10
    ); the user can raise it after a manual review pass.
四个规范Trace范围用例的具体用户提示主体,来自公开文档(Trace-Level Evaluations)。每个提示主体都配有描述评估准则的静态系统提示(无占位符)。
用例
filter
用户提示主体
目标完成 — 代理是否完成了用户请求
@parent_id:undefined @meta.span.kind:agent
用户目标:\n{{spans[0].meta.input.value}}\n\n代理步骤:\n{{spans}}
工具使用正确性 — 是否使用了正确的工具和参数
@parent_id:undefined @meta.span.kind:agent
用户问题:\n{{spans[0].meta.input.value}}\n\n工具调用:\n{{spans[meta.span.kind:tool].meta.input.parameters}}\n\n最终响应:\n{{spans[*].meta.output.value}}
RAG忠实度 — 回答是否基于检索到的文档
@parent_id:undefined
检索到的上下文:\n{{spans[meta.span.kind:retrieval].meta.output.documents[*].text}}\n\n最终答案:\n{{spans[meta.span.kind:llm].meta.output.value}}
对话质量 — 多轮对话的连贯性和一致性
@parent_id:undefined
对话:\n{{spans[meta.span.kind:llm].meta.input.messages[*].content}}\n\n助手响应:\n{{spans[meta.span.kind:llm].meta.output.messages[*].content}}
将这些作为起点。根据阶段1中观察到的应用实际Span名称/类型,调整
filter
和Span路径。
output_schema
包装格式(所有提供商必填)
output_schema
字段不是裸JSON Schema。必须使用OpenAI
json_schema
对象格式。
name
是固定类型判别符
,而非评估器名称——UI会针对严格允许列表进行验证,拒绝任何其他值:
LLMJudge类型
name
schema
内的属性键
布尔值
"boolean_eval"
boolean_eval
分数
"score_eval"
score_eval
分类
"categorical_eval"
categorical_eval
schema.properties
内的属性键必须与
name
完全匹配。
required
数组只能是
["<type_key>"]
["<type_key>", "reasoning"]
——任何其他值都会被拒绝。始终包含
"reasoning": {"type": "string"}
用于UI显示。
布尔值
BooleanStructuredOutput(pass_when=True)
):
json
{
  "output_schema": {
    "name": "boolean_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "boolean_eval": {"type": "boolean", "description": "是否符合标准"},
        "reasoning": {"type": "string", "description": "评估解释"}
      },
      "required": ["boolean_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"pass_when": true}
}
分数
ScoreStructuredOutput(min_score=1, max_score=10, min_threshold=7)
):
json
{
  "output_schema": {
    "name": "score_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "score_eval": {"type": "number", "description": "1到10分", "minimum": 1, "maximum": 10},
        "reasoning": {"type": "string", "description": "评分解释"}
      },
      "required": ["score_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"min_threshold": 7}
}
如果设置了
max_threshold
,将其添加到
assessment_criteria
中。
分类
CategoricalStructuredOutput(categories={...}, pass_values=[...])
):
json
{
  "output_schema": {
    "name": "categorical_eval",
    "strict": true,
    "schema": {
      "type": "object",
      "properties": {
        "categorical_eval": {
          "type": "string",
          "anyOf": [
            {"const": "correct", "description": "响应正确回答了问题"},
            {"const": "partially_correct", "description": "部分正确但缺少信息"},
            {"const": "incorrect", "description": "响应错误或无关"}
          ]
        },
        "reasoning": {"type": "string", "description": "类别选择解释"}
      },
      "required": ["categorical_eval", "reasoning"],
      "additionalProperties": false
    }
  },
  "assessment_criteria": {"pass_values": ["correct"]}
}
注意:分类使用
"type": "string"
anyOf
(每个
const
是字符串值),与离线SDK不同,离线SDK在属性根级别使用裸
anyOf
自定义/多维:无法通过固定名称schema直接支持。尽可能实现为分数或分类评估器,或拆分为多个评估器。
name
必须是上述三个固定值之一。
过滤范围:当提案针对特定Span类型(例如LLM子Span)时,将其转换为EVP
filter
查询——例如
@meta.span.kind:llm
service:checkout-agent
或更具体的标签。仅当目标是Trace根Span时,才结合
root_spans_only:true
对于
eval_scope: trace
  • 评估器在每个完成的Trace上触发一次,等待3分钟无活动窗口。同一Trace中前一个Span后超过3分钟到达的Span会被排除在评估之外。在提案中说明这一点,以便用户了解延迟和稀疏活动代理的潜在遗漏(步骤间隔超过3分钟的长运行代理)。
  • filter
    查询必须仅匹配Trace的根Span——始终包含
    @parent_id:undefined
    (或
    root_spans_only: true
    ),避免在后代Span上重复触发。结合
    @meta.span.kind:agent
    (或阶段1中观察到的应用根Span类型)进行缩小范围。
  • Trace范围的采样比Span范围更重(一个Trace=Judge侧的多个Span)。Trace范围评估器的默认
    sampling_percentage
    为**
    5
    **(而非Span范围的默认
    10
    );用户在手动审查后可以提高该值。

Always publish as draft (
enabled: false
)

始终以草稿形式发布(
enabled: false

Always create / update evaluators with
enabled: false
— regardless of whether
integration_account_id
was auto-detected from existing evaluators. The UI is the source of truth for activation; the skill should never auto-enable evaluators on the user's behalf. The user reviews each draft in the UI, confirms the integration account is correct (the auto-detected ID may belong to a different judge LLM than the one they want for this app), and flips the toggle when they're satisfied.
This makes the workflow safe by default: a wrong
integration_account_id
, a mistuned prompt, or an over-broad filter never goes live without a human pass. Auto-detection of the account ID still helps because the draft renders with the right account pre-selected — review is faster.
始终以
enabled: false
创建/更新评估器
——无论是否从现有评估器自动检测到
integration_account_id
。UI是激活的权威来源;技能绝不能代表用户自动启用评估器。用户在UI中审查每个草稿,确认集成账户正确(自动检测的ID可能属于与用户为此应用所需不同的Judge LLM),并在满意后切换开关。
这使工作流默认安全:错误的
integration_account_id
、调整不当的提示或过于宽泛的过滤器在没有人工检查的情况下永远不会生效。账户ID的自动检测仍然有用,因为草稿会预先选择正确的账户——审查更快。

integration_account_id resolution

integration_account_id解析

The
integration_account_id
is an opaque UUID that the UI matches against the org's integration accounts list to populate the account section dropdown. Users typically don't know this value, so never ask the user to supply a raw UUID.
Resolution order:
  1. Inherit from existing evaluators — in Phase 0 you called
    get_llmobs_evaluator
    for each existing custom evaluator. Check the
    llm_provider.integration_account_id
    field on those responses. If any of them have a value, use that same ID on the published drafts. If multiple different IDs appear across existing evaluators, pick the most common one and note which you chose so the user can correct it during the UI review pass.
  2. Omit if no existing evaluator has one — if no custom evaluator in the ml_app has an
    integration_account_id
    , omit the field from the publish payload. The draft will render without an account pre-selected; the user picks one during the UI review pass before activating.
Either way, the evaluator is published with
enabled: false
. The user is the gate — see "Always publish as draft" above.
integration_account_id
是不透明的UUID,UI会将其与组织的集成账户列表匹配,以填充账户部分的下拉菜单。用户通常不知道此值,因此切勿要求用户提供原始UUID
解析顺序
  1. 从现有评估器继承——在阶段0中,你调用了
    get_llmobs_evaluator
    获取每个现有自定义评估器。检查这些响应中的
    llm_provider.integration_account_id
    字段。如果其中任何一个有值,在发布的草稿中使用相同的ID。如果现有评估器中出现多个不同的ID,选择最常见的一个并说明你选择了哪个,以便用户在UI审查过程中更正。
  2. 如果没有现有评估器包含该ID则省略——如果ml_app中的自定义评估器都没有
    integration_account_id
    ,从发布负载中省略该字段。草稿将在未预先选择账户的情况下呈现;用户在激活前的UI审查过程中选择一个账户。
无论哪种情况,评估器都以
enabled: false
发布。用户是把关人——请参阅上面的“始终以草稿形式发布”。

Publish (single message — parallelize)

发布(单个消息——并行化)

Issue all
create_or_update_llmobs_evaluator
calls in a single message (one per evaluator). Set
telemetry.intent
to a short English description like
"skill:llm-obs-eval-bootstrap — Bootstrap evaluator suite for ml_app=<ml_app> from production trace analysis."
.
If any call fails, capture the error and continue with the remaining evaluators — never silently abort the batch. Report failures explicitly in the summary.
单个消息中发起所有
create_or_update_llmobs_evaluator
调用(每个评估器对应一次调用)。将
telemetry.intent
设置为简短的英文描述,例如
"skill:llm-obs-eval-bootstrap — Bootstrap evaluator suite for ml_app=<ml_app> from production trace analysis."
如果任何调用失败,捕获错误并继续处理剩余评估器——切勿静默中止批量操作。在摘要中明确报告失败情况。

Summary

摘要

undefined
undefined

Published Evaluators (drafts — pending UI review)

已发布评估器(草稿——待UI审查)

Wrote {N} online evaluators to ml_app
{ml_app}
. All published as drafts (
enabled: false
)
— review and activate them in the UI before they start scoring spans.
#NameActionProvider/ModelSamplingScopeAccount auto-detectedStatus
1task_completioncreated (draft)openai/gpt-5.4-mini10%spanyesok
2response_groundednessoverwrote (draft)openai/gpt-5.4-mini10%spanyesok
3scope_adherencerenamed (
_v2
) (draft)
openai/gpt-5.4-mini10%spanno — pick in UIok
4citation_formatfailedopenai/gpt-5.4-mini10%spanerror
{If any failed:} Errors:
  • {name}
    : {error message}
{If any code-based proposals were dropped:} Not published (code-based, not supported by online evaluator API):
  • {name}
    ({type}) — consider running offline via
    /eval-bootstrap {ml_app}
    (SDK mode).
已将{N}个在线评估器写入ml_app
{ml_app}
所有评估器均以草稿形式发布(
enabled: false
——在它们开始为Span评分前,在UI中审查并激活它们。
#名称操作提供商/模型采样率范围账户自动检测状态
1task_completion创建(草稿)openai/gpt-5.4-mini10%span成功
2response_groundedness覆盖(草稿)openai/gpt-5.4-mini10%span成功
3scope_adherence重命名(
_v2
)(草稿)
openai/gpt-5.4-mini10%span否——在UI中选择成功
4citation_format失败openai/gpt-5.4-mini10%span错误
{如果有失败情况:} 错误:
  • {name}
    : {错误消息}
{如果有基于代码的提案被丢弃:} 未发布(基于代码,在线评估器API不支持):
  • {name}
    ({type}) — 考虑通过
    /eval-bootstrap {ml_app}
    (SDK模式)离线运行。

Next Steps — review and activate in the UI

后续步骤——在UI中审查并激活

The drafts are intentionally not running yet. Walk through each one in the Datadog UI before flipping the enable toggle:
  1. Open the drafts: Datadog → LLM Observability → Evaluations → filter by ml_app
    {ml_app}
    (the new drafts appear with status
    Disabled
    ).
  2. For each draft:
    • Verify the integration account in the Provider section. If the column above shows
      auto-detected: yes
      , confirm it's the correct account for the judge LLM you want this evaluator to call through. If
      no
      , pick an account from the dropdown.
    • Skim the prompt template and the structured-output schema — make sure the spans-vs-trace scope, filter, and sampling match what you actually want to measure.
    • Click into a sample span/trace and use the test pane to dry-run the prompt against real data. Confirm the result matches your expectation.
  3. Enable: once each draft passes review, toggle it to enabled. Datadog starts scoring incoming spans immediately.
  4. Wait for first scores: with
    sampling_percentage=10
    (span scope) or
    5
    (trace scope), expect first results within minutes for high-traffic apps.
  5. Tune sampling/filter: if results are noisy or volume is too high, reduce
    sampling_percentage
    or tighten the
    filter
    from the UI. Re-running
    /eval-bootstrap {ml_app} --publish
    will round-trip the existing config before overwriting — your manual tweaks survive across reruns.
undefined
草稿目前未运行。在切换启用开关前,在Datadog UI中逐个检查:
  1. 打开草稿:Datadog → LLM Observability → Evaluations → 按ml_app
    {ml_app}
    过滤(新草稿显示为
    Disabled
    状态)。
  2. 针对每个草稿:
    • 验证集成账户:在提供商部分。如果上面的列显示
      自动检测:是
      ,确认它是你希望此评估器调用的Judge LLM的正确账户。如果显示
      ,从下拉菜单中选择一个账户。
    • 浏览提示模板和结构化输出schema——确保Span/Trace范围、过滤器和采样率与你实际要测量的内容匹配。
    • 点击示例Span/Trace并使用测试窗格针对真实数据试运行提示。确认结果符合你的预期。
  3. 启用:每个草稿通过审查后,切换为启用状态。Datadog立即开始为传入Span评分。
  4. 等待首次评分:对于
    sampling_percentage=10
    (Span范围)或
    5
    (Trace范围),高流量应用预计几分钟内会出现首次结果。
  5. 调整采样/过滤器:如果结果嘈杂或流量过高,从UI中降低
    sampling_percentage
    或收紧
    filter
    。重新运行
    /eval-bootstrap {ml_app} --publish
    会在覆盖前往返现有配置——你的手动调整会在重新运行后保留。
undefined

Notebook export (after summary)

Notebook导出(摘要后)

Same logic as Phase 3A — offer to append to the RCA notebook if
rca_notebook_url
was detected, or create a new standalone notebook. The notebook cell should list the published evaluators with their UI links and the
ml_app
they target. In pup mode, use
pup notebooks create
/
pup notebooks edit
as described in Phase 3A.

与阶段3A逻辑相同——如果检测到
rca_notebook_url
,提供附加到RCA Notebook的选项,否则创建新的独立Notebook。Notebook单元格应列出已发布的评估器及其UI链接和目标ml_app。在pup模式下,按照阶段3A中的描述使用
pup notebooks create
/
pup notebooks edit

Operating Rules

操作规则

  • Breadth over precision; let the user curate: Propose 8–15 evaluators distributed across domain-specific (largest bucket — derived from Phase 1 domain signals), outcome, format, and safety. Users can always remove what doesn't fit their quality bar; they cannot easily add what was not proposed. Anchor every domain-specific proposal in at least one observed trace pattern — don't invent generic domain evaluators without evidence.
  • Don't overfit: Write criteria that generalize beyond the specific sampled traces. Use examples as grounding, not as the sole criteria.
  • Show your work: Every proposed evaluator cites at least one trace as evidence with a clickable link:
    [Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id})
    .
  • New file only: Never modify existing evaluator code or experiment configurations.
  • Honest about uncertainty: If fewer than 5 traces support a proposed evaluator, flag it as tentative.

  • 广度优先于精度,让用户筛选:提出8-15个评估器,分布在领域特定(最大类别——来自阶段1的领域信号)、结果、格式和安全类中。用户始终可以移除不符合其质量标准的评估器;但他们无法轻松添加未提出的评估器。每个领域特定提案都必须至少有一个观察到的Trace模式作为基础——不要在没有证据的情况下发明通用领域评估器。
  • 不要过度拟合:编写的标准应超出特定采样Trace的范围。使用示例作为基础,而非唯一标准。
  • 展示工作过程:每个拟议评估器至少引用一个Trace作为证据,并提供可点击链接:
    [Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id})
  • 仅创建新文件:切勿修改现有评估器代码或实验配置。
  • 诚实面对不确定性:如果支持拟议评估器的Trace少于5个,标记为暂定。

Tool Reference

工具参考

This appendix applies only in pup mode. In MCP mode, use the tool names in the workflow sections directly.
本附录仅适用于pup模式。在MCP模式下,直接使用工作流章节中的工具名称。

Spans and traces

Span和Trace

MCP Toolpup Command
search_llmobs_spans(query, ml_app, from, to, limit, cursor, root_spans_only, span_kind, summary)
pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--cursor C] [--root-spans-only] [--span-kind K] [--summary]
always use
--query "@ml_app:A"
to filter by ml_app
; the
--ml-app A
flag is unreliable and silently returns spans from other apps.
get_llmobs_span_details(trace_id, span_ids, from, to)
pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...
get_llmobs_span_content(trace_id, span_id, field, path)
pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]
get_llmobs_trace(trace_id, include_tree)
pup llm-obs spans get-trace --trace-id T [--include-tree]
get_llmobs_agent_loop(trace_id, span_id)
pup llm-obs spans get-agent-loop --trace-id T [--span-id S]
find_llmobs_error_spans(trace_id)
pup llm-obs spans find-errors --trace-id T
expand_llmobs_spans(trace_id, span_ids, max_depth, filter_kind)
pup llm-obs spans expand --trace-id T --span-ids S1,S2,... [--max-depth N] [--filter-kind K]
MCP工具pup命令
search_llmobs_spans(query, ml_app, from, to, limit, cursor, root_spans_only, span_kind, summary)
pup llm-obs spans search --query "@ml_app:A [other_filters]" [--from F] [--to T] [--limit N] [--cursor C] [--root-spans-only] [--span-kind K] [--summary]
始终使用
--query "@ml_app:A"
按ml_app过滤
--ml-app A
标志不可靠,会静默返回其他应用的Span。
get_llmobs_span_details(trace_id, span_ids, from, to)
pup llm-obs spans get-details --trace-id T --span-ids S1,S2,...
get_llmobs_span_content(trace_id, span_id, field, path)
pup llm-obs spans get-content --trace-id T --span-id S --field F [--path P]
get_llmobs_trace(trace_id, include_tree)
pup llm-obs spans get-trace --trace-id T [--include-tree]
get_llmobs_agent_loop(trace_id, span_id)
pup llm-obs spans get-agent-loop --trace-id T [--span-id S]
find_llmobs_error_spans(trace_id)
pup llm-obs spans find-errors --trace-id T
expand_llmobs_spans(trace_id, span_ids, max_depth, filter_kind)
pup llm-obs spans expand --trace-id T --span-ids S1,S2,... [--max-depth N] [--filter-kind K]

Evaluators

评估器

MCP Toolpup Command
list_llmobs_evals()
pup llm-obs evals list
(filter by
ml_app
client-side)
list_llmobs_evals_by_ml_app(ml_app)
pup llm-obs evals list-by-ml-app --ml-app A
get_llmobs_evaluator(eval_name)
pup llm-obs evals get-evaluator EVAL_NAME
get_llmobs_eval_aggregate_stats(eval_name, ml_app, from, to)
pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]
delete_llmobs_evaluator(eval_name)
pup llm-obs evals delete EVAL_NAME
create_or_update_llmobs_evaluator(...)
pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json
— see flat schema note below
MCP工具pup命令
list_llmobs_evals()
pup llm-obs evals list
(在客户端按
ml_app
过滤)
list_llmobs_evals_by_ml_app(ml_app)
pup llm-obs evals list-by-ml-app --ml-app A
get_llmobs_evaluator(eval_name)
pup llm-obs evals get-evaluator EVAL_NAME
get_llmobs_eval_aggregate_stats(eval_name, ml_app, from, to)
pup llm-obs evals get-aggregate-stats EVAL_NAME [--ml-app A] [--from F] [--to T]
delete_llmobs_evaluator(eval_name)
pup llm-obs evals delete EVAL_NAME
create_or_update_llmobs_evaluator(...)
pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json
— 请参阅下面的扁平schema说明

create_or_update_llmobs_evaluator
in pup mode

pup模式下的
create_or_update_llmobs_evaluator

pup uses a flat JSON file (all fields top-level).
get-evaluator
returns a nested object. Transform as follows:
  1. Round-trip check: Call
    pup llm-obs evals get-evaluator EVAL_NAME
    first. If it exists, start from its config.
  2. Flatten
    llm_provider
    : hoist
    integration_provider
    ,
    model_name
    ,
    integration_account_id
    ,
    temperature
    to top level, dropping the
    llm_provider
    key.
  3. Merge and set
    enabled: false
    .
  4. Write to temp file and call:
    bash
    pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json
    Use unique temp file names when publishing multiple evaluators in parallel (e.g.
    /tmp/eval_toxicity.json
    ).
get-evaluator
field
Flat JSON key
llm_provider.integration_provider
integration_provider
llm_provider.model_name
model_name
llm_provider.integration_account_id
integration_account_id
llm_provider.temperature
temperature
All other fieldsUnchanged (already top-level)
pup使用扁平JSON文件(所有字段均为顶级)。
get-evaluator
返回嵌套对象。转换方式如下:
  1. 往返检查:首先调用
    pup llm-obs evals get-evaluator EVAL_NAME
    。如果存在,从其配置开始。
  2. 扁平化
    llm_provider
    :将
    integration_provider
    model_name
    integration_account_id
    temperature
    提升到顶级,删除
    llm_provider
    键。
  3. 合并并设置
    enabled: false
  4. 写入临时文件并调用:
    bash
    pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json
    并行发布多个评估器时使用唯一的临时文件名(例如
    /tmp/eval_toxicity.json
    )。
get-evaluator
字段
扁平JSON键
llm_provider.integration_provider
integration_provider
llm_provider.model_name
model_name
llm_provider.integration_account_id
integration_account_id
llm_provider.temperature
temperature
所有其他字段不变(已为顶级)

Notebooks

Notebook

MCP Toolpup Command
create_datadog_notebook(name, cells, ...)
pup notebooks create --title "TITLE" --file /tmp/nb_cells.json
— confirm exact flags with
pup notebooks create --help
edit_datadog_notebook(id, cells, append_only=true)
pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
(fetches current notebook, appends provided cells, writes back)
The cells file is a JSON array of cell objects:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]
  • MCP result parsing safety: Before writing any script (Python, jq, etc.) that iterates over or accesses fields in an MCP tool result, inspect the raw structure first — check
    type(result)
    , top-level keys, and whether the payload is nested inside a content block (e.g.
    [{'type': 'text', 'text': '<json>'}]
    ). Extract and
    json.loads()
    the inner payload if needed before parsing. Never assume MCP results are bare dicts or lists.
MCP工具pup命令
create_datadog_notebook(name, cells, ...)
pup notebooks create --title "TITLE" --file /tmp/nb_cells.json
— 使用
pup notebooks create --help
确认确切标志
edit_datadog_notebook(id, cells, append_only=true)
pup notebooks edit NOTEBOOK_ID --file /tmp/nb_cells.json
(获取当前Notebook,附加提供的单元格,写回)
单元格文件是单元格对象的JSON数组:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]
  • MCP结果解析安全:在编写任何迭代或访问MCP工具结果中字段的脚本(Python、jq等)之前,先检查原始结构——检查
    type(result)
    、顶级键,以及负载是否嵌套在内容块中(例如
    [{'type': 'text', 'text': '<json>'}]
    )。如果需要,提取并
    json.loads()
    内部负载后再解析。切勿假设MCP结果是裸字典或列表。