llm-obs-eval-bootstrap
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBackend
后端
Detection — At the start of every invocation, before taking any action, determine which backend to use:
- If the user passed anywhere in their invocation → use pup mode immediately, regardless of whether MCP tools are present. Skip steps 2–4.
--backend pup - Check whether MCP tools are present in your active tool list. The canonical signal is whether appears in your available tools.
mcp__datadog-llmo-mcp__list_llmobs_evals - If MCP tools are present → use MCP mode throughout. Call MCP tools exactly as named in this skill's workflow sections.
- If MCP tools are absent → check whether is executable: run
pupvia Bash. A JSON response containingpup --versionconfirms pup is available."version" - If pup responds → use pup mode throughout. Translate every MCP tool call to its pup equivalent using the Tool Reference appendix at the bottom of this file.
- If neither is available → stop and tell the user:
"Neither the Datadog MCP server nor the pup CLI is available. Connect the MCP server () or install pup."
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
--backend puppup invocation rules:
- Invoke via Bash:
pup llm-obs <subcommand> [flags] - pup always outputs JSON. Parse directly — no content-block unwrapping (unlike MCP results, which may wrap JSON in ).
[{"type": "text", "text": "<json>"}] - If pup returns an auth error, tell the user to run and stop.
pup auth login - Parallelization: issue multiple Bash tool calls in a single message (one pup command per call).
- Time flags: pup accepts bare duration strings (,
1h,7d) and RFC3339 timestamps. Do not use30m-prefixed strings — strip the prefix when converting from a skillnow-argument:--timeframe→now-7d,7d→now-24h,24h→now-30d.30d - on
--summarystrips payload fields to essential metadata only. Use it in bulk/search phases where content is not needed.pup llm-obs spans search
Invocation ID: At the very start of each invocation, before any MCP tool call, generate an 8-character hex invocation ID (e.g., ). Keep it constant for the entire invocation.
3a9f1c2bIntent tagging: On every MCP tool call, prefix with followed by a description of why the tool is being called. On the first MCP tool call only, use instead (note the suffix). Example first call:
telemetry.intentskill:llm-obs-eval-bootstrap[<inv_id>] — skill:llm-obs-eval-bootstrap:start[<inv_id>] — :startskill:llm-obs-eval-bootstrap:start[3a9f1c2b] — Phase 0: map existing eval coverage for task-cruncher检测逻辑 — 在每次调用开始、执行任何操作前,确定要使用的后端:
- 如果用户在调用中任何位置传入→ 立即使用pup模式,无论是否存在MCP工具。跳过步骤2-4。
--backend pup - 检查活跃工具列表中是否存在MCP工具。标准判断信号是可用工具中是否包含。
mcp__datadog-llmo-mcp__list_llmobs_evals - 如果存在MCP工具 → 全程使用MCP模式。严格按照本技能工作流章节中指定的名称调用MCP工具。
- 如果不存在MCP工具 → 检查是否可执行:通过Bash运行
pup。返回包含pup --version的JSON响应即确认pup可用。"version" - 如果pup响应正常 → 全程使用pup模式。使用本文件底部的工具参考附录,将每个MCP工具调用转换为对应的pup等效命令。
- 如果两者都不可用 → 停止操作并告知用户:
"Datadog MCP服务器和pup CLI均不可用。请连接MCP服务器()或安装pup。"
claude mcp add --scope user --transport http datadog-llmo-mcp 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'
--backend puppup调用规则:
- 通过Bash调用:
pup llm-obs <subcommand> [flags] - pup始终输出JSON。直接解析即可——无需解包内容块(与MCP结果不同,MCP结果可能将JSON包裹在中)。
[{"type": "text", "text": "<json>"}] - 如果pup返回认证错误,告知用户运行并停止操作。
pup auth login - 并行化:在单个消息中发起多个Bash工具调用(每个pup命令对应一次调用)。
- 时间参数:pup接受纯时长字符串(、
1h、7d)和RFC3339时间戳。不要使用30m前缀的字符串——转换技能的now-参数时需移除前缀:--timeframe→now-7d,7d→now-24h,24h→now-30d。30d - 在中使用
pup llm-obs spans search会将负载字段精简为核心元数据。在批量/搜索阶段不需要内容时使用该参数。--summary
调用ID:在每次调用的最开始、发起任何MCP工具调用前,生成一个8字符的十六进制调用ID(例如)。整个调用过程中保持该ID不变。
3a9f1c2b意图标记:在每个MCP工具调用中,将前缀设置为,后跟调用该工具的原因描述。仅在第一次MCP工具调用时,使用(注意后缀)。示例首次调用:
telemetry.intentskill:llm-obs-eval-bootstrap[<inv_id>] — skill:llm-obs-eval-bootstrap:start[<inv_id>] — :startskill:llm-obs-eval-bootstrap:start[3a9f1c2b] — Phase 0: map existing eval coverage for task-cruncherEval Bootstrap — Generate Evaluators from Production Traces
评估器引导——从生产Trace生成评估器
Given a sample of production LLM traces, analyze input/output patterns and quality dimensions, then emit a ready-to-use evaluator suite. Three output modes:
- (default) — Python
sdk_codefile using the Datadog Evals SDK (.py/BaseEvaluator) for offline experiments.LLMJudge - — self-contained JSON spec, framework-agnostic.
data_only - — write online LLM-judge evaluators directly to Datadog via
publish. These run automatically on matching production spans or traces (no dataset, no task function). The skill auto-classifies each proposed evaluator as span-scoped or trace-scoped based on what the judgment requires (a per-LLM-call tone check vs. an agent goal completion that needs the whole trace) — the user accepts or overrides the classification at the proposal checkpoint.create_or_update_llmobs_evaluator
基于生产LLM Trace样本,分析输入/输出模式和质量维度,然后生成可直接使用的评估器套件。支持三种输出模式:
- (默认)——使用Datadog Evals SDK(
sdk_code/BaseEvaluator)生成PythonLLMJudge文件,用于离线实验。.py - ——生成独立的JSON规范,与框架无关。
data_only - ——通过
publish直接将在线LLM-judge评估器写入Datadog。这些评估器会自动在匹配的生产Span或Trace上运行(无需数据集、无需任务函数)。技能会根据判断需求自动将每个拟议评估器分类为Span范围或Trace范围(例如,每个LLM调用的语气检查需要Span范围,而代理目标完成需要整个Trace则需要Trace范围)——用户会在提案检查点接受或覆盖该分类。create_or_update_llmobs_evaluator
Usage
使用方法
/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only | --publish]Arguments: $ARGUMENTS
/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only | --publish]参数:$ARGUMENTS
Inputs
输入项
| Input | Required | Default | Description |
|---|---|---|---|
| Yes | — | ML application to scope traces |
| No | | How far back to look |
| No | — | Failure taxonomy from |
| No | off | Emit a self-contained JSON spec file instead of Python SDK code |
| No | off | Publish online LLM-judge evaluators to Datadog (mutually exclusive with |
If is missing, ask the user before proceeding. If both and are supplied, error out and ask which mode the user wants.
ml_app--data-only--publish| 输入项 | 是否必填 | 默认值 | 描述 |
|---|---|---|---|
| 是 | — | 用于限定Trace范围的ML应用 |
| 否 | | 回溯时间范围 |
| 否 | — | 来自 |
| 否 | 关闭 | 生成独立的JSON规范文件,而非Python SDK代码 |
| 否 | 关闭 | 将在线LLM-judge评估器发布到Datadog(与 |
如果缺少,在继续前询问用户。如果同时提供和,抛出错误并询问用户想要使用哪种模式。
ml_app--data-only--publishAvailable Tools
可用工具
| Tool | Purpose |
|---|---|
| Find spans by eval presence, tags, span kind, query syntax. Paginate with cursor. |
| Metadata, evaluations (scores, labels, reasoning), and |
| Actual content for a span field. Supports JSONPath via |
| Full trace hierarchy as span tree with span counts by kind. |
| Chronological agent execution timeline (LLM calls, tool invocations, decisions). |
| List every evaluator configured for the caller's org across all ml_apps, with |
| Fetch the full persisted evaluator config by name (target ml_app + sampling + filter, provider, prompt template, parsing type, output schema, assessment criteria). Use in Phase 0 to understand what each existing custom eval measures, and (in publish mode) before any update — |
| (publish mode) Write an LLM-judge evaluator config to Datadog. Full-replace semantics: any omitted optional field resets to its default. See "Publishing Conventions" for required fields and structured output → JSON schema mapping. |
| (publish mode) Only used if the user explicitly asks to remove an evaluator. Never invoke speculatively. |
| 工具 | 用途 |
|---|---|
| 根据评估存在性、标签、Span类型、查询语法查找Span。使用游标分页。 |
| 获取元数据、评估结果(分数、标签、推理过程),以及显示可用字段和大小的 |
| 获取Span字段的实际内容。支持通过 |
| 获取完整的Trace层级结构,即包含各类型Span计数的Span树。 |
| 获取按时间顺序排列的代理执行时间线(LLM调用、工具调用、决策)。 |
| 列出调用者组织下所有ml_app中配置的所有评估器,包含 |
| 通过名称获取完整的持久化评估器配置(目标ml_app + 采样 + 过滤、提供商、提示模板、解析类型、输出 schema、评估标准)。在阶段0使用,以了解每个现有自定义评估器的测量内容;在发布模式下,任何更新前都要使用该工具—— |
| (发布模式)将LLM-judge评估器配置写入Datadog。全替换语义:任何省略的可选字段都会重置为默认值。有关必填字段和结构化输出→JSON schema映射,请参阅“发布约定”。 |
| (发布模式)仅在用户明确要求移除评估器时使用。切勿推测性调用。 |
Key get_llmobs_span_content
Patterns
get_llmobs_span_contentget_llmobs_span_content
关键使用模式
get_llmobs_span_contentUse the parameter to extract targeted data without fetching full payloads:
path| Field | Path | What you get |
|---|---|---|
| | System prompt (first message, usually |
| | Last assistant response |
| (no path) | Full conversation including tool calls |
| — | Span I/O |
| — | Retrieved documents (RAG apps) |
| — | Custom metadata (prompt versions, feature flags, user segments) |
使用参数提取目标数据,无需获取完整负载:
path| 字段 | 路径 | 获取内容 |
|---|---|---|
| | 系统提示(第一条消息,通常为 |
| | 最后一条助手响应 |
| (无路径) | 包含工具调用的完整对话 |
| — | Span输入/输出 |
| — | 检索到的文档(RAG应用) |
| — | 自定义元数据(提示版本、功能标志、用户细分) |
How to Use search_llmobs_spans
search_llmobs_spanssearch_llmobs_spans
使用方法
search_llmobs_spansAdditional filters combine with space (AND): . Dedicated params (, , ) work alongside , but takes precedence over .
@status:error @ml_app:my-appspan_kindroot_spans_onlyml_appqueryquerytagsTo find spans with a specific eval: — you can only query for eval presence, not specific results.
@evaluations.custom.<eval_name>:*附加过滤器使用空格组合(AND逻辑):。专用参数(、、)可与配合使用,但优先级高于。
@status:error @ml_app:my-appspan_kindroot_spans_onlyml_appqueryquerytags查找包含特定评估器的Span:——只能查询评估器的存在性,无法查询特定结果。
@evaluations.custom.<eval_name>:*Parallelization Rules
并行化规则
- : Group span_ids by trace_id. One call per trace_id with ALL its span_ids. Issue ALL calls for a page in a single message.
get_llmobs_span_details - : Each call is independent — always issue ALL in a single message.
get_llmobs_span_content - /
get_llmobs_trace: Parallelize across different traces in a single message.get_llmobs_agent_loop - Pipeline parallelism: Start for page 1 results immediately — don't wait to collect all pages.
get_llmobs_span_details
- :按trace_id对span_ids进行分组。每个trace_id对应一次调用,包含其所有span_ids。在单个消息中发起某一页的所有调用。
get_llmobs_span_details - :每次调用相互独立——始终在单个消息中发起所有调用。
get_llmobs_span_content - /
get_llmobs_trace:在单个消息中对不同Trace进行并行调用。get_llmobs_agent_loop - 流水线并行化:立即为第1页结果发起调用——无需等待收集所有页面。
get_llmobs_span_details
Evaluator SDK Reference
评估器SDK参考
Applies tomode only. Insdk_codemode, use this section as domain context when writing rubric prompts — no SDK classes are emitted.data_only
仅适用于模式。在sdk_code模式下,将本节作为领域上下文用于编写评估准则提示——不会生成SDK类。data_only
Imports
导入
python
undefinedpython
undefinedCore classes
核心类
from ddtrace.llmobs._experiment import BaseEvaluator, EvaluatorContext, EvaluatorResult
from ddtrace.llmobs._experiment import BaseEvaluator, EvaluatorContext, EvaluatorResult
LLM-as-judge
LLM作为Judge
from ddtrace.llmobs._evaluators.llm_judge import (
LLMJudge,
BooleanStructuredOutput,
ScoreStructuredOutput,
CategoricalStructuredOutput,
)
from ddtrace.llmobs._evaluators.llm_judge import (
LLMJudge,
BooleanStructuredOutput,
ScoreStructuredOutput,
CategoricalStructuredOutput,
)
Built-in evaluators (use only if needed)
内置评估器(仅在需要时使用)
from ddtrace.llmobs._evaluators.format import JSONEvaluator, LengthEvaluator
from ddtrace.llmobs._evaluators.string_matching import StringCheckEvaluator, RegexMatchEvaluator
Only import what the generated file actually uses.from ddtrace.llmobs._evaluators.format import JSONEvaluator, LengthEvaluator
from ddtrace.llmobs._evaluators.string_matching import StringCheckEvaluator, RegexMatchEvaluator
仅导入生成文件实际使用的类。EvaluatorContext (what evaluate()
receives)
evaluate()EvaluatorContext(evaluate()
接收的参数)
evaluate()python
@dataclass(frozen=True)
class EvaluatorContext:
input_data: dict[str, Any] # Task inputs (from dataset record, NOT from span)
output_data: Any # Task output (from task function return, NOT from span)
expected_output: Optional[JSONType] = None # Ground truth (if available)
metadata: dict[str, Any] = {} # Additional metadata
span_id: Optional[str] = None # LLMObs span ID
trace_id: Optional[str] = None # LLMObs trace IDImportant — span data vs evaluator data: When exploring production traces, you see span I/O (e.g., , ). But evaluators run in offline experiments where and come from the user's dataset records and task function, not from spans. The dataset schema is user-defined and may not match span structure. Write evaluator prompts with generic / placeholders and add comments describing what data the evaluator was designed for, so the user can adapt to their dataset shape.
input.valueoutput.messagesinput_dataoutput_data{{input_data}}{{output_data}}python
@dataclass(frozen=True)
class EvaluatorContext:
input_data: dict[str, Any] # 任务输入(来自数据集记录,而非Span)
output_data: Any # 任务输出(来自任务函数返回值,而非Span)
expected_output: Optional[JSONType] = None # 基准真值(如果可用)
metadata: dict[str, Any] = {} # 附加元数据
span_id: Optional[str] = None # LLMObs Span ID
trace_id: Optional[str] = None # LLMObs Trace ID重要——Span数据与评估器数据的区别:在探索生产Trace时,看到的是Span输入/输出(例如、)。但评估器在离线实验中运行,和来自用户的数据集记录和任务函数,而非Span。数据集schema由用户定义,可能与Span结构不匹配。在评估器提示中使用通用的 / 占位符,并添加注释说明评估器设计用于何种数据,以便用户适配其数据集结构。
input.valueoutput.messagesinput_dataoutput_data{{input_data}}{{output_data}}EvaluatorResult (what evaluate()
returns)
evaluate()EvaluatorResult(evaluate()
返回的结果)
evaluate()python
EvaluatorResult(
value=..., # Required. JSONType (str, int, float, bool, None, list, dict)
reasoning="...", # Optional. Explanation string
assessment="pass" or "fail", # Optional. Pass/fail assessment
metadata={...}, # Optional. Evaluation metadata dict
tags={...}, # Optional. Tags dict
)python
EvaluatorResult(
value=..., # 必填。JSONType(字符串、整数、浮点数、布尔值、None、列表、字典)
reasoning="...", # 可选。解释字符串
assessment="pass" or "fail", # 可选。通过/失败评估结果
metadata={...}, # 可选。评估元数据字典
tags={...}, # 可选。标签字典
)LLMJudge — LLM-as-Judge Evaluator
LLMJudge——LLM作为Judge的评估器
python
judge = LLMJudge(
user_prompt="...", # Required. Supports {{template_vars}}
system_prompt="...", # Optional. Does NOT support template vars
structured_output=..., # Optional. Boolean/Score/Categorical output, or a dict for custom JSON schema
provider="openai", # "openai" | "anthropic" | "azure_openai" | "vertexai" | "bedrock"
model="gpt-4o", # Model identifier
model_params={"temperature": 0.0}, # Optional. Passed to LLM API
name="eval_name", # Optional. Must match ^[a-zA-Z0-9_-]+$
)Template variables in : , , , — resolved from fields via dot-path into nested dicts.
user_prompt{{input_data}}{{output_data}}{{expected_output}}{{metadata.key}}EvaluatorContextpython
judge = LLMJudge(
user_prompt="...", # 必填。支持{{template_vars}}
system_prompt="...", # 可选。不支持模板变量
structured_output=..., # 可选。布尔值/分数/分类输出,或自定义JSON schema的字典
provider="openai", # "openai" | "anthropic" | "azure_openai" | "vertexai" | "bedrock"
model="gpt-4o", # 模型标识符
model_params={"temperature": 0.0}, # 可选。传递给LLM API的参数
name="eval_name", # 可选。必须匹配^[a-zA-Z0-9_-]+$
)user_prompt{{input_data}}{{output_data}}{{expected_output}}{{metadata.key}}EvaluatorContextStructured Output Types
结构化输出类型
Boolean — true/false with optional pass/fail:
python
BooleanStructuredOutput(
description="Whether the response is factually accurate",
reasoning=True, # Include reasoning field in LLM response
reasoning_description=None, # Optional custom description for reasoning field
pass_when=True, # True → pass when true, False → pass when false, None → no assessment
)Score — numeric within a range with optional thresholds:
python
ScoreStructuredOutput(
description="Helpfulness score",
min_score=1, # Minimum possible score
max_score=10, # Maximum possible score
reasoning=True,
reasoning_description=None,
min_threshold=7, # Scores >= 7 pass (optional)
max_threshold=None, # Scores <= N pass (optional)
)Categorical — select from predefined categories:
python
CategoricalStructuredOutput(
categories={
"correct": "The response correctly answers the question",
"partially_correct": "The response is partially correct but missing key information",
"incorrect": "The response is factually wrong or irrelevant",
},
reasoning=True,
reasoning_description=None,
pass_values=["correct"], # Which categories count as passing (optional)
)Custom JSON schema — arbitrary structured responses for multi-dimensional evals:
python
undefined布尔值——真/假,可选通过/失败标记:
python
BooleanStructuredOutput(
description="Whether the response is factually accurate",
reasoning=True, # 在LLM响应中包含推理字段
reasoning_description=None, # 推理字段的可选自定义描述
pass_when=True, # True→为真时通过,False→为假时通过,None→无评估结果
)分数——指定范围内的数值,可选阈值:
python
ScoreStructuredOutput(
description="Helpfulness score",
min_score=1, # 最小可能分数
max_score=10, # 最大可能分数
reasoning=True,
reasoning_description=None,
min_threshold=7, # 分数≥7时通过(可选)
max_threshold=None, # 分数≤N时通过(可选)
)分类——从预定义类别中选择:
python
CategoricalStructuredOutput(
categories={
"correct": "The response correctly answers the question",
"partially_correct": "The response is partially correct but missing key information",
"incorrect": "The response is factually wrong or irrelevant",
},
reasoning=True,
reasoning_description=None,
pass_values=["correct"], # 哪些类别算作通过(可选)
)自定义JSON schema——用于多维评估的任意结构化响应:
python
undefinedPass a raw dict as structured_output — used as the JSON schema directly
传递原始字典作为structured_output——直接用作JSON schema
structured_output={
"type": "object",
"properties": {
"relevance": {"type": "boolean", "description": "Whether the response addresses the question"},
"confidence": {"type": "number", "description": "Confidence score (0.0 to 1.0)"},
"reasoning": {"type": "string", "description": "Explanation for the evaluation"},
},
"required": ["relevance", "confidence", "reasoning"],
"additionalProperties": False,
}
Always write standard JSON schema — the SDK adapts it per provider automatically (e.g., Anthropic doesn't support `minimum`/`maximum` on number fields, so the SDK moves range constraints into the `description`; Vertex AI converts `const`/`anyOf` to `enum`). The full parsed JSON dict becomes the eval `value`; a `"reasoning"` key (if present) is automatically extracted. No automatic pass/fail assessment.structured_output={
"type": "object",
"properties": {
"relevance": {"type": "boolean", "description": "Whether the response addresses the question"},
"confidence": {"type": "number", "description": "Confidence score (0.0 to 1.0)"},
"reasoning": {"type": "string", "description": "Explanation for the evaluation"},
},
"required": ["relevance", "confidence", "reasoning"],
"additionalProperties": False,
}
始终编写标准JSON schema——SDK会自动根据提供商进行适配(例如,Anthropic不支持数字字段的`minimum`/`maximum`,因此SDK会将范围约束移至`description`中;Vertex AI将`const`/`anyOf`转换为`enum`)。完整解析后的JSON字典成为评估的`value`;如果存在`"reasoning"`键,会自动提取。不会自动生成通过/失败评估结果。LLMJudge Prompt Guidelines
LLMJudge提示准则
The parameter enforces the response format via JSON schema. Do not prescribe the format in the prompt (no "Answer YES/NO", "Rate 1-10", etc.). Instead, describe the evaluation criteria and let the structured output handle the format.
structured_output- system_prompt: Set the judge's role and the app's domain context. Does NOT support template vars.
- user_prompt: Present the data via /
{{input_data}}, then describe what good vs. bad looks like for this dimension.{{output_data}}
structured_output- system_prompt:设置Judge的角色和应用的领域上下文。不支持模板变量。
- user_prompt:通过/
{{input_data}}呈现数据,然后描述该维度下好与坏的表现。{{output_data}}
BaseEvaluator — Custom Code-Based Evaluator
BaseEvaluator——基于自定义代码的评估器
For deterministic checks that do not need LLM judgment:
python
class MyEvaluator(BaseEvaluator):
def __init__(self, name=None, ...custom_params...):
super().__init__(name=name)
self._param = ... # Store config as private attrs
def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
# Access: context.input_data, context.output_data, context.expected_output, context.metadata
# Must NOT modify self attributes (thread safety)
passed = ... # Your logic here
return EvaluatorResult(
value=passed,
reasoning="...",
assessment="pass" if passed else "fail",
)用于无需LLM判断的确定性检查:
python
class MyEvaluator(BaseEvaluator):
def __init__(self, name=None, ...custom_params...):
super().__init__(name=name)
self._param = ... # 将配置存储为私有属性
def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
# 访问:context.input_data, context.output_data, context.expected_output, context.metadata
# 不得修改self属性(线程安全)
passed = ... # 此处编写你的逻辑
return EvaluatorResult(
value=passed,
reasoning="...",
assessment="pass" if passed else "fail",
)Built-in Evaluators
内置评估器
python
undefinedpython
undefinedValidate JSON syntax + optional required keys
验证JSON语法 + 可选必填键
JSONEvaluator(required_keys=["name", "age"], output_extractor=None, name=None)
JSONEvaluator(required_keys=["name", "age"], output_extractor=None, name=None)
Validate length (characters, words, or lines)
验证长度(字符、单词或行数)
LengthEvaluator(count_by="words", min_length=10, max_length=500, output_extractor=None, name=None)
LengthEvaluator(count_by="words", min_length=10, max_length=500, output_extractor=None, name=None)
count_by: "characters" | "words" | "lines"
count_by: "characters" | "words" | "lines"
String matching
字符串匹配
StringCheckEvaluator(operation="contains", expected="success", case_sensitive=False, name=None)
StringCheckEvaluator(operation="contains", expected="success", case_sensitive=False, name=None)
operation: "eq" | "ne" | "contains" | "icontains"
operation: "eq" | "ne" | "contains" | "icontains"
Regex matching
正则匹配
RegexMatchEvaluator(pattern=r"\d{4}-\d{2}-\d{2}", match_mode="search", name=None)
RegexMatchEvaluator(pattern=r"\d{4}-\d{2}-\d{2}", match_mode="search", name=None)
match_mode: "search" | "match" | "fullmatch"
match_mode: "search" | "match" | "fullmatch"
undefinedundefinedEvaluator Type Decision Matrix
评估器类型决策矩阵
| Signal | Evaluator Type |
|---|---|
| Output must be valid JSON | |
| Output must match a regex pattern | |
| Output has length constraints | |
| Output must contain/not contain specific strings | |
| Semantic quality judgment (tone, accuracy, completeness) | |
| Graded quality on a scale | |
| Classification into categories | |
| Multi-dimensional judgment (evaluate several aspects at once) | |
| Complex domain logic combining multiple checks | |
| 信号 | 评估器类型 |
|---|---|
| 输出必须是有效的JSON | |
| 输出必须匹配正则模式 | |
| 输出有长度限制 | |
| 输出必须包含/不包含特定字符串 | |
| 语义质量判断(语气、准确性、完整性) | |
| 按比例评分的质量 | |
| 分类到类别中 | |
| 多维判断(同时评估多个方面) | |
| 结合多个检查的复杂领域逻辑 | |
Source Verification
源码验证
If you have access to dd-trace-py locally, verify the API surface by reading the corresponding modules:
- —
ddtrace.llmobs._evaluators.llm_judge,LLMJudge,BooleanStructuredOutput,ScoreStructuredOutputCategoricalStructuredOutput - —
ddtrace.llmobs._experiment,BaseEvaluator,EvaluatorContextEvaluatorResult - —
ddtrace.llmobs._evaluators.format,JSONEvaluatorLengthEvaluator - —
ddtrace.llmobs._evaluators.string_matching,StringCheckEvaluatorRegexMatchEvaluator
如果本地可以访问dd-trace-py,通过阅读相应模块验证API接口:
- —
ddtrace.llmobs._evaluators.llm_judge、LLMJudge、BooleanStructuredOutput、ScoreStructuredOutputCategoricalStructuredOutput - —
ddtrace.llmobs._experiment、BaseEvaluator、EvaluatorContextEvaluatorResult - —
ddtrace.llmobs._evaluators.format、JSONEvaluatorLengthEvaluator - —
ddtrace.llmobs._evaluators.string_matching、StringCheckEvaluatorRegexMatchEvaluator
Workflow
工作流
Phase 0: Resolve Inputs & Entry Mode
阶段0:解析输入与确定进入模式
Entry mode detection:
| Mode | Signal | Behavior |
|---|---|---|
| Cold Start | Only | Full open discovery — understand what the app does, identify quality dimensions worth measuring, propose evals for coverage |
| From RCA | Conversation contains an RCA report or user provides a failure hypothesis | Skip open discovery — use existing failure taxonomy as eval targets |
Parse arguments: Extract (first non-flag argument), (default ), , and flags. Set if is set, if is set, otherwise . Error if both and are present.
ml_app--timeframenow-7d--data-only--publishoutput_mode = publish--publishoutput_mode = data_only--data-onlyoutput_mode = sdk_code--data-only--publishResolution steps:
-
Ifnot provided → ask the user.
ml_app -
Auto-detect entry mode:
- If the conversation contains an RCA report (look for "Failure Taxonomy" heading, structured failure modes, or severity ratings) → . Extract the taxonomy.
from_rca - If the user provides a free-text failure hypothesis (e.g., "the system prompt lacks grounding") → . Use the hypothesis as the starting eval target.
from_rca - Otherwise → .
cold_start
- If the conversation contains an RCA report (look for "Failure Taxonomy" heading, structured failure modes, or severity ratings) →
-
Ifnot provided → default to
timeframe.now-7d -
Map existing eval coverage — skip if(there is no Datadog eval project to check coverage against): Call
output_mode = data_only(org-wide; filter the result client-side to entries wherelist_llmobs_evals). Then, for each eval withml_app == <ml_app>, callsource=customto inspect its prompt template, target, sampling, and filter, and infer which quality dimension it covers. Issue all evaluator calls in a single message (parallelize). Skipget_llmobs_evaluator(eval_name=...)evals — their names are self-describing and they may not have a fetchable config.source=ootbBy the end of this step you have a complete coverage map:. Carry this into Phase 2 for deduplication.{eval_name → source, enabled, dimension}Inmode, also note any template-variable convention the existing custom evaluators already use (so a new suite reads consistently). Online evaluator templates resolve against the full span JSON, not againstpublish. See the "Online Template Variables" section under "Publishing Conventions" for the supported syntax (EvaluatorContext,{{span_input}}, dot-paths, array selectors, filter accessors).{{span_output}} -
Notebook context detection: Scan the current conversation for a Datadog notebook URL that was produced by(pattern:
/eval-trace-rca). If found, store it ashttps://app.datadoghq.com/notebook/{numeric-id}and extract the numeric ID asrca_notebook_url. This is used after Phase 3 to offer appending the evaluator suite to that notebook instead of creating a new one.rca_notebook_id
进入模式检测:
| 模式 | 信号 | 行为 |
|---|---|---|
| 冷启动 | 仅提供 | 全面开放探索——了解应用功能,确定值得测量的质量维度,提出评估器以覆盖这些维度 |
| 来自RCA | 对话包含RCA报告或用户提供失败假设 | 跳过开放探索——使用现有故障分类作为评估目标 |
解析参数:提取(第一个非标志参数)、(默认)、和标志。如果设置,则;如果设置,则;否则。如果同时提供和,抛出错误。
ml_app--timeframenow-7d--data-only--publish--publishoutput_mode = publish--data-onlyoutput_mode = data_onlyoutput_mode = sdk_code--data-only--publish解析步骤:
-
如果未提供→ 询问用户。
ml_app -
自动检测进入模式:
- 如果对话包含RCA报告(查找“Failure Taxonomy”标题、结构化故障模式或严重性评级)→ 。提取分类信息。
from_rca - 如果用户提供自由文本形式的失败假设(例如“系统提示缺乏基础信息”)→ 。将该假设作为初始评估目标。
from_rca - 否则 → 。
cold_start
- 如果对话包含RCA报告(查找“Failure Taxonomy”标题、结构化故障模式或严重性评级)→
-
如果未提供→ 默认使用
timeframe。now-7d -
映射现有评估覆盖范围 — 如果则跳过(无Datadog评估项目可检查覆盖范围):调用
output_mode = data_only(全组织范围;在客户端过滤list_llmobs_evals的条目)。然后,对于每个ml_app == <ml_app>的评估器,调用source=custom以检查其提示模板、目标、采样和过滤规则,并推断其覆盖的质量维度。在单个消息中发起所有评估器调用(并行化)。跳过get_llmobs_evaluator(eval_name=...)的评估器——它们的名称自描述,且可能无法获取配置。source=ootb此步骤结束后,你将获得完整的覆盖范围映射:。将其带入阶段2以进行去重。{eval_name → source, enabled, dimension}在发布模式下,还需注意现有自定义评估器已使用的模板变量约定(以便新套件保持一致)。在线评估器模板针对完整Span JSON解析,而非。有关支持的语法(EvaluatorContext、{{span_input}}、点路径、数组选择器、过滤器访问器),请参阅“发布约定”下的“在线模板变量”部分。{{span_output}} -
Notebook上下文检测:扫描当前对话,查找由生成的Datadog Notebook URL(模式:
/eval-trace-rca)。如果找到,将其存储为https://app.datadoghq.com/notebook/{numeric-id},并提取数字ID作为rca_notebook_url。阶段3结束后,将使用此信息提供将评估器套件附加到该Notebook而非创建新Notebook的选项。rca_notebook_id
Phase 1: Explore Traces & Identify Eval Targets
阶段1:探索Trace并确定评估目标
Goal: Sample production traces, understand what the app does, and identify quality dimensions worth measuring.
目标:采样生产Trace,了解应用功能,确定值得测量的质量维度。
Cold Start Path
冷启动路径
-
Sample the app:. Filter by
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", root_spans_only=true, limit=50, from=<timeframe>)— error spans have no output to evaluate.@status:ok -
Profile the app and identify evaluation target spans: Callfor span_ids grouped by trace_id. Inspect
get_llmobs_span_detailsto classify:content_infoSignal App Profile hascontent_infomessagesLLM/chat app hascontent_infodocumentsRAG app Spans include kindagentAgent app hascontent_infometadataHas custom metadata Multiple span kinds in one trace ( +agent/tool+retrievalfromllm)get_llmobs_traceMulti-step app — at least one trace-scope evaluator likely belongs in the suite ( mode)publishFor agent/multi-step apps, also callon 2-3 traces to see the full span hierarchy. Compareget_llmobs_tracebetween the root span and its sub-spans. Then ask two questions for each candidate quality dimension, in this order:content_info- Does the verdict depend on more than one span? (e.g., faithfulness depends on a span's documents AND an
retrievalspan's answer; goal completion depends on the chain ofllmcalls AND the final response.) If yes → trace scope intoolmode. Don't try to compress this into a single span.publish - Only if the answer to (1) is no: pick the single span with the richest signal for that dimension (root has the summary; LLM sub-spans have the full system prompt + tool call results + reasoning chain).
Record the span-kind histogram (agent + tool + llm + retrieval) — multiple kinds under one root is a strong signal you'll have at least one trace-scope evaluator in the suite. See Phase 2's "Span vs. Trace Scope Classification" for the mandatory walk-through of canonical trace-scope use cases. - Does the verdict depend on more than one span? (e.g., faithfulness depends on a
-
Extract content and identify targets: Callfor representative spans. Fetch fields based on app profile:
get_llmobs_span_contentApp Profile Fields to Fetch LLM/chat (messagesfor system prompt),path=$.messages[0]outputRAG ,documents,inputoutputAgent for the agent span, thenget_llmobs_agent_loopfor detailmessagesAny with metadata metadataIssue all calls in a single message. As you read, capture two streams of signal:Generic quality signals — what does "success" look like? What variance exists across outputs? Each observed quality dimension becomes a candidate evaluator, with the traces you've just read as evidence. Also look for safety signals (scope violations, sensitive data in outputs, out-of-character responses) and add a safety evaluator if you find them.Domain signals — these become the domain-specific evaluator category in Phase 2 (the highest-leverage category). For every 5–10 traces, write down:- Recurring intents / question categories — what classes of request does this app handle? (,
applying for benefit X,comparing flight options,summarizing a policy)creating a widget - Entities the app emits in outputs — URLs, agency / company names, code identifiers, monetary amounts, dates, IDs, file paths, phone numbers. Note which ones the user acts on downstream (those are worth a correctness evaluator) versus which are passing references.
- Tool argument shapes (for agent apps) — name each tool the agent calls and the rough schema of its inputs. Tools with non-trivial schemas (≥ 3 fields, structured types) are candidates for argument-correctness evaluators.
- Persona / voice rules — does the app always cite a source, always refuse certain topics (medical, legal, financial advice), always speak in a particular tone? Extract the rules implicitly followed across observed outputs.
- Failure modes specific to the domain — fabricated identifiers, outdated policy references, currency / locale mismatches, off-by-one errors in IDs, wrong units. One observed instance is enough to seed a candidate evaluator.
Don't try to enumerate domain signals exhaustively before reading traces — let the patterns surface as you read. The goal is breadth in the eventual proposal, not completeness in this exploration step. - Recurring intents / question categories — what classes of request does this app handle? (
-
采样应用:。按
search_llmobs_spans(query="@ml_app:\"<ml_app>\" @status:ok", root_spans_only=true, limit=50, from=<timeframe>)过滤——错误Span无输出可评估。@status:ok -
分析应用概况并确定评估目标Span:按trace_id分组调用获取span_ids。检查
get_llmobs_span_details进行分类:content_info信号 应用概况 包含content_infomessagesLLM/聊天应用 包含content_infodocumentsRAG应用 Span包含 类型agent代理应用 包含content_infometadata包含自定义元数据 单个Trace中包含多种Span类型( +agent/tool+retrieval,来自llm)get_llmobs_trace多步骤应用——套件中可能至少包含一个Trace范围的评估器(发布模式) 对于代理/多步骤应用,还需调用获取2-3个Trace的完整Span层级结构。比较根Span与其子Span的get_llmobs_trace。然后针对每个候选质量维度依次提出两个问题:content_info- 判断结果是否依赖多个Span?(例如,忠实度依赖Span的文档和
retrievalSpan的回答;目标完成依赖llm调用链和最终响应。)如果是 → 发布模式下使用Trace范围。不要尝试将其压缩到单个Span中。tool - 仅当(1)的答案为否时:选择该维度信号最丰富的单个Span(根Span包含摘要;LLM子Span包含完整系统提示 + 工具调用结果 + 推理链)。
记录Span类型直方图(agent + tool + llm + retrieval)——单个根Span下包含多种类型是套件中至少包含一个Trace范围评估器的强烈信号。有关规范Trace范围用例的强制说明,请参阅阶段2的“Span vs Trace范围分类”。 - 判断结果是否依赖多个Span?(例如,忠实度依赖
-
提取内容并确定目标:调用获取代表性Span的内容。根据应用概况获取字段:
get_llmobs_span_content应用概况 要获取的字段 LLM/聊天 (messages获取系统提示)、path=$.messages[0]outputRAG 、documents、inputoutput代理 获取代理Span的 ,然后获取get_llmobs_agent_loop详情messages包含元数据的任何应用 metadata在单个消息中发起所有调用。阅读时,捕获两类信号:通用质量信号 — “成功”的表现是什么?输出存在哪些差异?每个观察到的质量维度都成为候选评估器,你刚刚读取的Trace作为证据。同时查找安全信号(范围违规、输出中的敏感数据、不符合角色的响应),如果发现则添加安全评估器。领域信号 — 这些将成为阶段2中的领域特定评估器类别(价值最高的类别)。每读取5-10个Trace,记录:- 重复意图/问题类别 — 应用处理哪些类型的请求?(、
申请福利X、比较航班选项、总结政策)创建小部件 - 应用在输出中生成的实体 — URL、机构/公司名称、代码标识符、金额、日期、ID、文件路径、电话号码。注意哪些是用户后续会操作的(这些值得添加正确性评估器),哪些只是引用。
- 工具参数形状(针对代理应用) — 命名代理调用的每个工具及其输入的大致schema。具有非平凡schema(≥3个字段、结构化类型)的工具是参数正确性评估器的候选对象。
- 角色/语气规则 — 应用是否始终引用来源、始终拒绝某些主题(医疗、法律、财务建议)、始终使用特定语气?从观察到的输出中提取隐含遵循的规则。
- 特定领域的故障模式 — 虚构标识符、过时政策引用、货币/区域不匹配、ID中的差一错误、错误单位。只要观察到一个实例,就足以作为候选评估器的种子。
不要在读取Trace前尝试穷举领域信号——让模式在读取过程中自然浮现。目标是最终提案的广度,而非此探索步骤的完整性。 - 重复意图/问题类别 — 应用处理哪些类型的请求?(
From RCA Path
来自RCA的路径
-
Extract the failure taxonomy from the RCA report. Each failure mode with High or Medium severity becomes an eval target.
-
Check root cause categories for infrastructure failures. Before proposing evaluators, scan the Root Cause column of the taxonomy for any of:,
Instrumentation Deficiency,Harness Deficiency,Runtime Error, or any other root cause that points to infrastructure/environment rather than model behavior. If any are present, pause and ask:Upstream Data Issue"Some failure modes were diagnosed as infrastructure or instrumentation issues rather than model behavior (e.g.,). Evaluators can be designed two ways:{list the infra root causes}- Behavior-targeted (recommended for ongoing quality): measure whether the model produces correct, specific output — useful once the infrastructure is fixed and you want to track real quality
- Artifact-targeted (useful as regression guard): detect the specific broken output observed (e.g., generic placeholder responses) — catches regressions if the infrastructure breaks again
Which approach do you want, or both?"- If behavior-targeted: design evaluators for what correct output looks like, not what the broken output looked like. Use the RCA's / gold-standard examples as the quality bar.
expected_output - If artifact-targeted: design evaluators that detect the specific failure symptom (e.g., for a known bad string,
StringCheckEvaluatorthat checks for generic placeholders).LLMJudge - If both: propose each category separately, clearly labelled.
If all root causes are behavioral (System Prompt Deficiency, Tool Gap, Tool Misuse, Retrieval Failure, etc.) → skip this step and proceed directly. -
For each target: if the RCA includes trace IDs, use them directly; otherwise search for matching traces. Fetch 2-3 traces per target withto understand the concrete pattern.
get_llmobs_span_content
-
从RCA报告中提取故障分类。每个具有高或中严重性的故障模式都成为评估目标。
-
检查根本原因类别是否包含基础设施故障。在提出评估器前,扫描分类的“Root Cause”列,查找以下任何一项:、
Instrumentation Deficiency、Harness Deficiency、Runtime Error,或任何其他指向基础设施/环境而非模型行为的根本原因。如果存在,暂停并询问:Upstream Data Issue"某些故障模式被诊断为基础设施或工具问题,而非模型行为(例如)。评估器可通过两种方式设计:{列出基础设施根本原因}- 行为导向(推荐用于持续质量监控):测量模型是否生成正确、特定的输出——在基础设施修复后,用于跟踪实际质量
- ** artifact导向**(用作回归防护):检测观察到的特定故障输出(例如通用占位符响应)——如果基础设施再次故障,可捕获回归问题
你想要哪种方法,还是两者都要?"- 如果选择行为导向:设计评估器以衡量正确输出的表现,而非故障输出的表现。使用RCA中的/黄金标准示例作为质量标准。
expected_output - 如果选择artifact导向:设计评估器以检测特定故障症状(例如针对已知错误字符串的、检查通用占位符的
StringCheckEvaluator)。LLMJudge - 如果选择两者都要:分别提出每个类别,明确标记。
如果所有根本原因都是行为相关的(System Prompt Deficiency、Tool Gap、Tool Misuse、Retrieval Failure等)→ 跳过此步骤,直接继续。 -
针对每个目标:如果RCA包含Trace ID,直接使用;否则搜索匹配的Trace。调用获取每个目标的2-3个Trace,以了解具体模式。
get_llmobs_span_content
Phase 2: Propose Evaluator Suite
阶段2:提出评估器套件
Goal: Present a concrete evaluator proposal for user confirmation.
Each evaluator judges one data point — it receives input and output for a single record/span, not a full trace or batch. Design evaluators accordingly.
Targeting depends on :
output_mode- /
sdk_code→ offline experiments. Template variables usedata_onlyfields (EvaluatorContext,{{input_data}}). The actual data shape depends on the user's dataset and task function (see EvaluatorContext note in SDK Reference).{{output_data}} - → online evaluation on production spans. Template variables resolve against the full span JSON via dot-paths (
publish,{{meta.input.value}}, …) or the built-in span-kind-aware aliases ({{meta.output.messages[*].content}},{{span_input}}). See "Online Template Variables" under Publishing Conventions for the full syntax. Each evaluator also needs{{span_output}},eval_scope, and (optionally)sampling_percentage— surface these in the proposal table so the user can confirm before publishing.filter
Order proposals from broadest signal to most granular. Propose broadly, let the user curate — see "How many evaluators to propose" below.
-
Domain-specific evaluators — What does "good" mean for this specific app? These are the highest-leverage proposals because they capture quality bars generic evaluators miss. Derive them from the domain signals Phase 1 captured:
- Recurring intents / question categories the app handles (e.g., "applying for a federal benefit", "comparing flight options", "explaining a policy"). Propose an or
intent_classificationevaluator scoped to the dominant intents.intent_handling_correctness - Specific entities the app produces (URLs, agency names, code identifiers, monetary amounts, dates, IDs). Propose a per-entity correctness evaluator for the ones with real downstream cost when wrong (e.g., ,
cited_url_is_real,agency_name_matches_request).monetary_amount_is_consistent_with_input - Tool argument shapes observed across spans. Propose a per-tool argument-correctness evaluator for the tools with non-trivial schemas (e.g.,
tool,search_flights_args_match_user_request).update_dashboard_widget_targets_correct_widget - Persona / voice expectations — does the app always cite sources, always refuse out-of-scope requests, always speak in a specific tone? Propose evaluators for the voice rules you can extract from observed outputs (,
cites_a_source,refuses_medical_advice).tone_matches_brand - Domain-specific failure modes seen across traces (fabricated identifiers, outdated policy references, unit mismatches, currency / locale mismatches). One evaluator per recurring failure mode.
Name each evaluator after the user-facing concern, not the technical check (overagency_url_is_real). Use the trace IDs you read in Phase 1 as evidence — at least one passing case and one failing case per evaluator if you saw both.regex_url_match - Recurring intents / question categories the app handles (e.g., "applying for a federal benefit", "comparing flight options", "explaining a policy"). Propose an
-
Outcome evaluators — Did this span / trace produce a good result for the request?
- Examples: ,
task_completion,answer_correctnessresponse_groundedness
- Examples:
-
Format evaluators — Does the output meet structural requirements?
- Examples: ,
valid_json_output,response_lengthcitation_format
- Examples:
-
Safety evaluators — Does the output stay within appropriate boundaries?
- Examples: ,
no_pii_leakage,scope_adherenceno_hallucination
- Examples:
目标:提出具体的评估器提案供用户确认。
每个评估器判断一个数据点——接收单个记录/Span的输入和输出,而非完整Trace或批量数据。据此设计评估器。
目标定位取决于:
output_mode- /
sdk_code→ 离线实验。模板变量使用data_only字段(EvaluatorContext、{{input_data}})。实际数据形状取决于用户的数据集和任务函数(请参阅SDK参考中的EvaluatorContext说明)。{{output_data}} - → 生产Span上的在线评估。模板变量通过点路径针对完整Span JSON解析(
publish、{{meta.input.value}}等),或使用内置的Span类型感知别名({{meta.output.messages[*].content}}、{{span_input}})。有关完整语法,请参阅发布约定下的“在线模板变量”。每个评估器还需要{{span_output}}、eval_scope和(可选)sampling_percentage——在提案表格中显示这些内容,以便用户在发布前确认。filter
按从宽泛到精细的顺序排列提案。广泛提出,让用户筛选——请参阅下面的“要提出多少个评估器”。
-
领域特定评估器 — 对于此特定应用,“好”的定义是什么?这些是价值最高的提案,因为它们捕获了通用评估器无法覆盖的质量标准。从阶段1捕获的领域信号中推导:
- 应用处理的重复意图/问题类别(例如“申请联邦福利”、“比较航班选项”、“解释政策”)。针对主要意图提出或
intent_classification评估器。intent_handling_correctness - 应用生成的特定实体(URL、机构名称、代码标识符、金额、日期、ID)。针对错误会产生实际下游成本的实体提出每个实体的正确性评估器(例如、
cited_url_is_real、agency_name_matches_request)。monetary_amount_is_consistent_with_input - Span中观察到的工具参数形状。针对具有非平凡schema的工具提出每个工具的参数正确性评估器(例如
tool、search_flights_args_match_user_request)。update_dashboard_widget_targets_correct_widget - 角色/语气期望 — 应用是否始终引用来源、始终拒绝超出范围的请求、始终使用特定语气?针对从观察到的输出中提取的语气规则提出评估器(、
cites_a_source、refuses_medical_advice)。tone_matches_brand - Trace中发现的特定领域故障模式(虚构标识符、过时政策引用、单位不匹配、货币/区域不匹配)。每个重复故障模式对应一个评估器。
每个评估器的名称以用户关注的问题命名,而非技术检查(例如使用而非agency_url_is_real)。使用阶段1中读取的Trace ID作为证据——如果同时看到通过和失败案例,每个评估器至少引用一个通过案例和一个失败案例。regex_url_match - 应用处理的重复意图/问题类别(例如“申请联邦福利”、“比较航班选项”、“解释政策”)。针对主要意图提出
-
结果评估器 — 此Span/Trace是否为请求生成了良好结果?
- 示例:、
task_completion、answer_correctnessresponse_groundedness
- 示例:
-
格式评估器 — 输出是否符合结构要求?
- 示例:、
valid_json_output、response_lengthcitation_format
- 示例:
-
安全评估器 — 输出是否保持在适当范围内?
- 示例:、
no_pii_leakage、scope_adherenceno_hallucination
- 示例:
How many evaluators to propose
要提出多少个评估器
The default cap from the older skill version was too tight — it pushed the skill toward generic evaluators only and left domain signals on the table. Updated guidance:
4-6- Aim for 8–15 evaluators in the proposal, distributed across all four categories (with domain-specific usually the largest bucket, outcome second, format and safety smaller). For very simple single-LLM-call apps, fewer is fine; for agent / RAG apps with rich domain signals, lean toward the upper end.
- Quality > generic: every domain-specific proposal should be backed by at least one observed pattern in the sampled traces. Don't invent generic domain evaluators ("") if you don't have evidence for them.
response_quality - Let the user curate: the MANDATORY CHECKPOINT below explicitly asks the user to remove what doesn't apply, not just to approve. Treat the proposal as a candidate set the user trims.
旧版技能的默认4-6个上限过于严格——它迫使技能仅提出通用评估器,而忽略领域信号。更新后的指南:
- 目标提出8-15个评估器,分布在所有四个类别中(领域特定通常是最大的类别,结果类其次,格式和安全类较小)。对于非常简单的单LLM调用应用,数量可以更少;对于具有丰富领域信号的代理/RAG应用,倾向于上限。
- 质量优先于通用:每个领域特定提案都必须至少有一个采样Trace中的观察模式作为支持。如果没有证据,不要发明通用领域评估器(例如)。
response_quality - 让用户筛选:下面的强制检查点明确要求用户移除不适用的评估器,而非仅仅批准。将提案视为用户会精简的候选集。
Deduplication Against Existing Coverage
与现有覆盖范围去重
In mode: skip this section entirely (coverage map was not built in Phase 0). Proceed directly to the proposal table.
data_onlyBefore building the proposal, apply the coverage map from Phase 0. Coverage is keyed on — not on dimension alone: every OOTB evaluator runs at span scope, and an enabled OOTB eval does NOT preclude proposing a trace-scope evaluator for the same dimension. The two answer different questions.
(dimension, scope)-
Enabled span-scope eval (OOTB or custom) for dimension D:
- Do NOT propose a new span-scope evaluator for D — that dimension is already covered at span scope.
- DO propose a trace-scope evaluator for D when the trace shape calls for it (multi-step app, judgment depends on cross-span context). Note the relationship in the rationale: e.g., "OOTB evaluates each LLM span in isolation; this trace-scope
Goal Completenesschecks whether the agent's full sequence of steps achieved the user's request — different question."goal_completion
-
Enabled trace-scope custom eval for dimension D: do NOT propose another trace-scope evaluator for the same dimension; that's a real duplicate. Span-scope on the same dimension is still fair game if the data also fits a single span.
-
Disabled OOTB eval: Do NOT propose a new custom span-scope evaluator for that dimension. Instead, surface it in a short note within the proposal and suggest enabling it in the Datadog UI rather than creating a duplicate. Example:(ootb, disabled) — consider enabling in Datadog UI (Evaluations → Configure) instead of creating a custom span-scope eval. (A trace-scope
hallucinationis still in scope and covers a different question.)rag_faithfulness -
Gap identification: Open the proposal with a coverage summary line: "Existing coverage: N evaluator(s) already configured ({names}, all span-scope unless noted). Proposing evaluators for uncovered dimensions and uncovered scopes."
-
All dimensions covered: A dimension is "fully covered" only when both scopes are present (or the scope doesn't apply to the app shape). If the coverage map accounts for every identified quality dimension at the appropriate scope(s), surface this explicitly and ask the user what they want: (a) review/improve existing eval prompts, (b) add coverage for additional dimensions, or (c) proceed anyway.
For each proposed evaluator:
- Name: Must match (alphanumeric, underscore, hyphen only)
^[a-zA-Z0-9_-]+$ - Type: (Boolean/Score/Categorical/custom JSON schema), built-in (
LLMJudge,JSONEvaluator, etc.), orRegexMatchEvaluatorsubclass. InBaseEvaluatormode, only LLM-judge evaluators are supported by the MCP tool — code-based checks must NOT be silently dropped. List them in the same proposal table withpublishset to the code-based class, mark them under a "Not publishable in this mode" subsection of the proposal, and tell the user to run the skill again in defaultTypemode (orsdk_code) to capture them. Treat the code-based proposals as part of the suite for counting and coverage purposes.--data-only - What it measures: 1-2 sentence plain-language description
- Target span: Which span's data the evaluator was designed for (e.g., "root agent span", "LLM sub-span ", "all
anthropic.requestspans"). If the root span's I/O is too lossy for the quality dimension (e.g., tool call results aren't visible), note this and specify which sub-span has the signal. Inllmmode this maps to a combination ofpublish(eval_scope/span/trace),session, and the EVProot_spans_onlyquery (e.g.filteror@meta.span.kind:llm).service:web - Pass/fail criteria: ,
pass_when=True,min_threshold=7, or "no automatic assessment" for custom JSON schemapass_values=["correct"] - Template variables: Which of ,
input_data,output_data,expected_outputit uses (offline) — or which span paths / aliases it pulls from (publish mode:metadata.*,{{span_input}},{{span_output}},{{meta.input.messages[*].content}}, etc.){{meta.metadata.<key>}} - Evidence: At least one trace where it would have caught a failure (or confirmed correct behavior)
- Publish-only fields (only in mode):
publish(defaultintegration_provider),openai(defaultmodel_name),gpt-5.4-mini(defaultsampling_percentage),10(defaulteval_scope), and anyspanquery needed to scope to the right spans. Surface defaults in the proposal so the user can override before publishing.filter - (only in
integration_account_idmode): the integration account the judge LLM is called through. Auto-detected from existing evaluators in the same ml_app (Phase 0 coverage map). Never asked from the user as a raw UUID. If no existing evaluator has one, the field is omitted and the user picks an account in the UI before activating. All evaluators are published withpublishregardless — see "Always publish as draft" in Phase 3C for the full activation workflow.enabled: false
在模式下:完全跳过本节(阶段0未构建覆盖范围映射)。直接进入提案表格。
data_only在构建提案前,应用阶段0的覆盖范围映射。覆盖范围以为键——而非仅维度:每个OOTB评估器都在Span范围运行,启用的OOTB评估器并不排除针对同一维度提出Trace范围评估器的可能性。两者回答的是不同的问题。
(dimension, scope)-
针对维度D的已启用Span范围评估器(OOTB或自定义):
- 不要针对D提出新的Span范围评估器——该维度已在Span范围覆盖。
- 当Trace形状需要时(多步骤应用、判断依赖跨Span上下文),要针对D提出Trace范围评估器。在理由中说明关系:例如“OOTB 单独评估每个LLM Span;此Trace范围的
Goal Completeness检查代理的完整步骤序列是否实现了用户请求——这是不同的问题。”goal_completion
-
针对维度D的已启用Trace范围自定义评估器:不要针对同一维度提出另一个Trace范围评估器——这是真正的重复。如果数据也适合单个Span,同一维度的Span范围评估器仍然是可行的。
-
已禁用的OOTB评估器:不要针对该维度提出新的自定义Span范围评估器。相反,在提案中添加简短说明,建议在Datadog UI中启用它,而非创建重复项。示例:(ootb,已禁用)——考虑在Datadog UI(Evaluations → Configure)中启用,而非创建自定义Span范围评估器。(Trace范围的
hallucination仍然适用,且覆盖不同的问题。)rag_faithfulness -
差距识别:在提案开头添加覆盖范围摘要行:“现有覆盖范围:已配置N个评估器({名称},除非特别说明,否则均为Span范围)。针对未覆盖的维度和范围提出评估器。”
-
所有维度已覆盖:仅当两种范围都存在(或范围不适用于应用形状)时,维度才被“完全覆盖”。如果覆盖范围映射涵盖了所有已识别的质量维度及其适当范围,明确指出这一点并询问用户想要:(a) 审查/改进现有评估提示,(b) 添加对其他维度的覆盖,或(c) 继续执行。
针对每个拟议评估器:
- 名称:必须匹配(仅字母数字、下划线、连字符)
^[a-zA-Z0-9_-]+$ - 类型:(布尔值/分数/分类/自定义JSON schema)、内置(
LLMJudge、JSONEvaluator等),或RegexMatchEvaluator子类。在发布模式下,MCP工具仅支持LLM-judge评估器——基于代码的检查不得被静默丢弃。在同一提案表格中列出它们,将BaseEvaluator设置为基于代码的类,在提案的“此模式下不可发布”小节中标记,并告知用户重新运行技能的默认Type模式(或sdk_code)以捕获它们。将基于代码的提案视为套件的一部分,用于计数和覆盖范围目的。--data-only - 测量内容:1-2句通俗易懂的描述
- 目标Span:评估器设计用于哪个Span的数据(例如“根代理Span”、“LLM子Span ”、“所有
anthropic.requestSpan”)。如果根Span的输入/输出对于质量维度来说信息损失过大(例如工具调用结果不可见),注明这一点并指定哪个子Span包含信号。在发布模式下,这对应于llm(eval_scope/span/trace)、session和EVProot_spans_only查询的组合(例如filter或@meta.span.kind:llm)。service:web - 通过/失败标准:、
pass_when=True、min_threshold=7,或自定义JSON schema的“无自动评估结果”pass_values=["correct"] - 模板变量:使用、
input_data、output_data、expected_output中的哪些(离线)——或从哪些Span路径/别名提取(发布模式:metadata.*、{{span_input}}、{{span_output}}、{{meta.input.messages[*].content}}等){{meta.metadata.<key>}} - 证据:至少一个它会捕获故障(或确认正确行为)的Trace
- 仅发布模式字段(仅在发布模式下):(默认
integration_provider)、openai(默认model_name)、gpt-5.4-mini(默认sampling_percentage)、10(默认eval_scope),以及任何需要限定到正确Span的span查询。在提案中显示默认值,以便用户在发布前覆盖。filter - (仅在发布模式下):Judge LLM调用所通过的集成账户。从同一ml_app中的现有评估器(阶段0覆盖范围映射)自动检测。切勿要求用户提供原始UUID。如果没有现有评估器包含该ID,省略该字段,用户在激活前在UI中选择账户。无论如何,所有评估器都以
integration_account_id发布——有关完整激活工作流,请参阅阶段3C中的“始终以草稿形式发布”。enabled: false
Span vs. Trace Scope Classification (publish
mode)
publishSpan vs Trace范围分类(发布模式)
Don't ask the user; classify per evaluator and let them override at the checkpoint.
不要询问用户;针对每个评估器进行分类,让用户在检查点覆盖。
Mandatory: walk the four canonical trace-scope use cases first
强制:首先检查四个规范Trace范围用例
If Phase 1 found multi-step traces (≥ 2 span kinds, or any / / span under an root), you MUST walk through the four canonical trace-scope use cases below before finalizing the suite. For each, decide explicitly: applies (include with ) or does not apply (record a one-line reason in a "Skipped trace-scope candidates" subsection of the proposal). Skipping all four without per-item justification is a sign you've over-anchored on span scope — re-check.
toolretrievalworkflowagenteval_scope: trace| Canonical use case | Triggers when |
|---|---|
| Any agent / multi-step app. Almost always applies. |
| Trace contains |
| Trace contains |
| Trace contains ≥ 2 |
For other proposed evaluators (e.g. tone, format, safety), apply this two-question test:
- Can the judgment be answered correctly from one span's +
meta.input, where "correctly" means the verdict cannot change if you considered other spans in the trace? →meta.output.eval_scope: span - Otherwise → . In particular, default to trace when the evaluator name contains grounding, faithfulness, hallucination, completeness, correctness across steps, consistency, or workflow — these almost always need cross-span context.
eval_scope: trace
如果阶段1发现多步骤Trace(≥2种Span类型,或根Span下的任何//Span),在最终确定套件前必须检查以下四个规范Trace范围用例。针对每个用例,明确决定:适用(包含)或不适用(在提案的“跳过的Trace范围候选”小节中记录一行理由)。如果没有逐项理由就跳过所有四个用例,表明你过度依赖Span范围——重新检查。
agenttoolretrievalworkfloweval_scope: trace| 规范用例 | 触发条件 |
|---|---|
| 任何代理/多步骤应用。几乎总是适用。 |
| Trace包含 |
| Trace包含 |
| Trace包含≥2个 |
对于其他拟议评估器(例如语气、格式、安全),应用以下两个问题的测试:
- 判断结果是否可以仅从一个Span的+
meta.input正确得出,其中“正确”意味着考虑Trace中的其他Span不会改变判断结果? →meta.output。eval_scope: span - 否则 → 。特别是,当评估器名称包含grounding、faithfulness、hallucination、completeness、跨步骤正确性、consistency或workflow时,默认使用Trace范围——这些几乎总是需要跨Span上下文。
eval_scope: trace
Trade-offs (don't let these dominate the choice)
权衡(不要让这些主导选择)
Trace scope costs more than span scope: one judgment per completed trace (vs. per matching span), larger prompt payloads, and a 3-minute trigger latency (Datadog waits 3 minutes of inactivity before considering a trace complete; later spans are excluded). These are cost-control levers — handle with and , not by demoting scope. The correctness of the eval is what picks the scope.
sampling_percentagefilterTrace范围的成本高于Span范围:每个完成的Trace对应一次判断(vs每个匹配Span对应一次),提示负载更大,触发延迟为3分钟(Datadog等待3分钟无活动后才认为Trace完成;后续Span被排除)。这些是成本控制手段——通过和处理,而非降低范围。评估的正确性才是选择范围的依据。
sampling_percentagefilterSurface the classification
显示分类结果
Add a Scope column to the proposal table and a one-sentence rationale per evaluator. If you skipped a canonical trace-scope use case, list it under a "Skipped trace-scope candidates" subsection with the reason — the user will see and can override.
Example rationales:
— span. Judging "is this single response polite" needs only one LLM span'stone_check; no other span in the trace can change that verdict.meta.output.messages[*].content — trace. Whether the agent finished the user's request depends on the sequence of tool calls and the final LLM response together —goal_completionof any single span only shows that step's output.meta.output — trace. Comparing tool inputs against the request and the final response requires correlating ≥ 3 spans (root, tool, final LLM).tool_use_correctness — trace. Grounding pairs therag_faithfulnessspan's documents with the LLM span's answer.retrievalExample "Skipped trace-scope candidates" entry:
— skipped: traces contain a single LLM call (no multi-turn signal in this app's instrumentation).conversation_quality
在提案表格中添加范围列,并为每个评估器添加一句理由。如果跳过了某个规范Trace范围用例,在“跳过的Trace范围候选”小节中列出并说明理由——用户会看到并可以覆盖。
示例理由:
— Span范围。判断“此单个响应是否礼貌”仅需要一个LLM Span的tone_check;Trace中的其他Span不会改变该判断结果。meta.output.messages[*].content — Trace范围。代理是否完成用户请求取决于工具调用序列和最终LLM响应的组合;任何单个Span的goal_completion仅显示该步骤的输出。meta.output — Trace范围。将工具输入与请求和最终响应进行比较需要关联≥3个Span(根Span、工具Span、最终LLM Span)。tool_use_correctness — Trace范围。忠实度需要将rag_faithfulnessSpan的文档与LLM Span的回答配对。retrieval示例“跳过的Trace范围候选”条目:
— 跳过:Trace包含单个LLM调用(此应用的工具中无多轮信号)。conversation_quality
MANDATORY CHECKPOINT
强制检查点
You MUST output the proposal and wait for user confirmation before proceeding.
undefined必须输出提案并等待用户确认后再继续。
undefinedProposed Evaluator Suite
拟议评估器套件
App profile: {LLM | RAG | Agent | Multi-agent}
Entry mode: {cold_start | from_rca}
| # | Name | Type | Scope | Measures | Pass Criteria |
|---|---|---|---|---|---|
| 1 | task_completion | LLMJudge (Boolean) | span | Whether the task was completed on this span | pass_when=True |
| 2 | tool_use_correctness | LLMJudge (Categorical) | trace | Right tool with right arguments across the agent run | pass_values=["correct"] |
| 3 | ... | ... | ... | ... | ... |
(Drop the Scope column when not in mode.)
publishFor each evaluator:
- {name}: {what it measures}
- Target span: {which span's data it was designed for}
- Rationale: {which quality dimension it covers and why}
- {Only in publish mode:} Scope: {span | trace} — {one-sentence rationale}
- Evidence: Trace {id_short}
{Only in publish mode, for multi-step apps. Required if any of the four canonical trace-scope use cases was not included above:}
Skipped trace-scope candidates:
- — {one-line reason it does not apply to this app}
{canonical_use_case}
{Only in publish mode, when the suite contains code-based evaluators (JSONEvaluator, RegexMatchEvaluator, LengthEvaluator, StringCheckEvaluator, BaseEvaluator). Required when any code-based proposal exists.}
Not publishable in this mode (code-based evaluators — the publish API is LLM-judge only):
- ({type}) — {what it would check}. Re-run
{name}in default mode to emit as offline SDK code, or/eval-bootstrap {ml_app}for a framework-agnostic JSON spec./eval-bootstrap {ml_app} --data-only
**Which evaluators should I generate?** Treat the proposal as a candidate set — the suite below is intentionally broad so you can pick what matters for your team's quality bar. Reply with **which to keep, which to drop, and which to rename**; not every domain-specific proposal will fit your priorities. In `sdk_code` mode you may also add custom evaluators or change provider/model. In `publish` mode you may override `integration_provider`, `model_name`, `sampling_percentage`, `eval_scope`, `root_spans_only`, or `filter` per evaluator.
Do NOT proceed to code generation until the user confirms.
---应用概况:{LLM | RAG | Agent | Multi-agent}
进入模式:{cold_start | from_rca}
| # | 名称 | 类型 | 范围 | 测量内容 | 通过标准 |
|---|---|---|---|---|---|
| 1 | task_completion | LLMJudge (Boolean) | span | 此Span上的任务是否完成 | pass_when=True |
| 2 | tool_use_correctness | LLMJudge (Categorical) | trace | 代理运行过程中是否使用了正确的工具和参数 | pass_values=["correct"] |
| 3 | ... | ... | ... | ... | ... |
(非发布模式下删除范围列。)
针对每个评估器:
- {name}:{测量内容}
- 目标Span:{设计用于哪个Span的数据}
- 理由:{覆盖的质量维度及原因}
- {仅发布模式下:} 范围:{span | trace} — {一句理由}
- 证据:Trace {id_short}
{仅发布模式下,针对多步骤应用。如果上述未包含四个规范Trace范围用例中的任何一个,必填:}
跳过的Trace范围候选:
- — {不适用于此应用的一行理由}
{canonical_use_case}
{仅发布模式下,当套件包含基于代码的评估器(JSONEvaluator、RegexMatchEvaluator、LengthEvaluator、StringCheckEvaluator、BaseEvaluator)时。当存在任何基于代码的提案时必填:}
此模式下不可发布(基于代码的评估器——发布API仅支持LLM-judge):
- ({type}) — {它会检查的内容}。重新运行
{name}的默认模式以生成为离线SDK代码,或运行/eval-bootstrap {ml_app}以生成框架无关的JSON规范。/eval-bootstrap {ml_app} --data-only
**我应该生成哪些评估器?** 将提案视为候选集——下面的套件故意设计得很宽泛,以便你选择符合团队质量标准的评估器。回复**保留哪些、删除哪些、重命名哪些**;并非所有领域特定提案都符合你的优先级。在`sdk_code`模式下,你还可以添加自定义评估器或更改提供商/模型。在发布模式下,你可以针对每个评估器覆盖`integration_provider`、`model_name`、`sampling_percentage`、`eval_scope`、`root_spans_only`或`filter`。
在用户确认前,不要继续生成代码。
---Phase 3: Generate Output
阶段3:生成输出
Branch on :
output_mode- → Phase 3A below
sdk_code - → skip to Phase 3B
data_only - → skip to Phase 3C
publish
根据分支:
output_mode- → 下面的阶段3A
sdk_code - → 跳至阶段3B
data_only - → 跳至阶段3C
publish
Phase 3A: Generate & Write Evaluator Code
阶段3A:生成并写入评估器代码
Goal: Generate the final file and write it to disk.
.pyFor each confirmed evaluator, generate production-quality Python code following the SDK Reference patterns above.
目标:生成最终的文件并写入磁盘。
.py针对每个确认的评估器,按照上述SDK参考模式生成生产级Python代码。
Code Generation Rules
代码生成规则
-
Ground prompts in traces: LLMJudge system prompts and user prompts must reference patterns actually observed in production traces. Never write generic prompts like "evaluate whether the response is good" — ground them in the app's domain, observed failure patterns, and success criteria.
-
Keep template variables generic, add comments for context: Useand
{{input_data}}as top-level placeholders in prompts — do NOT reference nested span paths like{{output_data}}. The evaluator's data comes from the user's dataset and task function, not directly from spans. Instead, add a comment above each evaluator describing what data it was designed for and what the user should adapt:{{input_data.messages[-1].content}}python# Designed for: input_data = user query, output_data = assistant response text # Observed from: root agent span (input.value → output.value) # If your dataset uses a different structure, adapt the prompt references below. -
Use the narrowest evaluator type: If a check can be done with,
JSONEvaluator,RegexMatchEvaluator, orStringCheckEvaluator, do NOT use an LLMJudge. Code-based evaluators are faster, cheaper, and deterministic.LengthEvaluator -
BaseEvaluator subclasses:
- Call in
super().__init__(name=name)__init__ - Return from
EvaluatorResultevaluate() - Do NOT modify instance attributes in (thread safety)
evaluate()
- Call
-
Names: Must match. Use snake_case descriptive names.
^[a-zA-Z0-9_-]+$ -
Imports: Consolidate at the top of the file. Only import classes that are actually used.
-
Evaluator list: Collect all evaluators into anlist at the bottom of the file.
evaluators -
Anonymize PII: Strip emails, names, and sensitive data from any trace content included in LLMJudge prompts or the header comment.
-
基于Trace编写提示:LLMJudge系统提示和用户提示必须引用生产Trace中实际观察到的模式。切勿编写通用提示,例如“评估响应是否良好”——要基于应用领域、观察到的故障模式和成功标准编写。
-
保持模板变量通用,添加注释说明上下文:在提示中使用和
{{input_data}}作为顶级占位符——不要引用嵌套Span路径,例如{{output_data}}。评估器的数据来自用户的数据集和任务函数,而非直接来自Span。相反,在每个评估器上方添加注释,说明其设计用于何种数据以及用户应如何适配:{{input_data.messages[-1].content}}python# 设计用途:input_data = 用户查询,output_data = 助手响应文本 # 来自:根代理Span(input.value → output.value) # 如果你的数据集使用不同结构,请适配下面的提示引用。 -
使用最窄的评估器类型:如果检查可以通过、
JSONEvaluator、RegexMatchEvaluator或StringCheckEvaluator完成,不要使用LLMJudge。基于代码的评估器更快、更便宜且具有确定性。LengthEvaluator -
BaseEvaluator子类:
- 在中调用
__init__super().__init__(name=name) - 从返回
evaluate()EvaluatorResult - 不要在中修改实例属性(线程安全)
evaluate()
- 在
-
名称:必须匹配。使用蛇形命名法的描述性名称。
^[a-zA-Z0-9_-]+$ -
导入:在文件顶部合并导入。仅导入实际使用的类。
-
评估器列表:在文件底部将所有评估器收集到列表中。
evaluators -
匿名化PII:从LLMJudge提示或头部注释中包含的任何Trace内容中剥离电子邮件、姓名和敏感数据。
Output Format
输出格式
The generated file should follow this structure:
.pypython
"""
Auto-generated evaluators for {ml_app}
Generated: {YYYY-MM-DD} by eval-bootstrap
App profile: {LLM | RAG | Agent | Multi-agent}
Quality dimensions covered:
- {target_name}: {description}
Evidence: https://app.datadoghq.com/llm/traces?query=trace_id:{full_id}
...
Usage:
from ddtrace.llmobs import LLMObs
experiment = LLMObs.experiment(
name="my-experiment",
task=my_task_fn,
dataset=dataset,
evaluators=evaluators,
)
experiment.run()
"""
{imports — only what is used}生成的文件应遵循以下结构:
.pypython
"""
为{ml_app}自动生成的评估器
生成时间:{YYYY-MM-DD} by eval-bootstrap
应用概况:{LLM | RAG | Agent | Multi-agent}
覆盖的质量维度:
- {target_name}: {description}
证据:https://app.datadoghq.com/llm/traces?query=trace_id:{full_id}
...
使用方法:
from ddtrace.llmobs import LLMObs
experiment = LLMObs.experiment(
name="my-experiment",
task=my_task_fn,
dataset=dataset,
evaluators=evaluators,
)
experiment.run()
"""
{导入——仅导入使用的类}--- Outcome Evaluators ---
--- 结果评估器 ---
{evaluator code}
{评估器代码}
--- Format Evaluators ---
--- 格式评估器 ---
{evaluator code}
{评估器代码}
--- Safety Evaluators ---
--- 安全评估器 ---
{evaluator code}
{评估器代码}
--- Evaluator Suite ---
--- 评估器套件 ---
evaluators = [
{eval_1_variable_name},
{eval_2_variable_name},
...
]
Only include section comments (Outcome/Format/Safety) for categories that have evaluators.evaluators = [
{eval_1_variable_name},
{eval_2_variable_name},
...
]
仅为包含评估器的类别添加节注释(结果/格式/安全)。Write the file
写入文件
Write the generated code to the output path (suggest if not specified), then display a summary:
./evals/{ml_app}_evaluators.pyundefined将生成的代码写入输出路径(如果未指定,建议使用),然后显示摘要:
./evals/{ml_app}_evaluators.pyundefinedGenerated Evaluators
生成的评估器
Wrote {N} evaluators to :
{output_path}| # | Name | Type | Covers |
|---|---|---|---|
| 1 | ... | ... | ... |
已将{N}个评估器写入:
{output_path}| # | 名称 | 类型 | 覆盖内容 |
|---|---|---|---|
| 1 | ... | ... | ... |
Next Steps
后续步骤
- Review: Check the generated prompts and criteria match your expectations
- Test offline: Use to batch-evaluate against a labeled dataset and verify scores
LLMObs.experiment(evaluators=evaluators)
undefined- 审查:检查生成的提示和标准是否符合你的预期
- 离线测试:使用针对标记数据集批量评估并验证分数
LLMObs.experiment(evaluators=evaluators)
undefinedNotebook export (after summary)
Notebook导出(摘要后)
After displaying the summary, offer notebook export.
-
Ifwas detected in Phase 0:
rca_notebook_urlAn RCA notebook was created earlier in this session:Would you like to (a) append the evaluator suite summary to that notebook, or (b) create a new standalone notebook?{rca_notebook_url}If append: use the notebook creation fallback pattern (see below) with(mcp__datadog-mcp__edit_datadog_notebook,id={rca_notebook_id}, evaluator suite summary cell).append_only=trueIf new: use the notebook creation fallback pattern (see below) with.mcp__datadog-mcp__create_datadog_notebook -
If no:
rca_notebook_urlWould you like to export this evaluator suite summary to a Datadog notebook?If yes: use the notebook creation fallback pattern (see below) with:mcp__datadog-mcp__create_datadog_notebook- :
nameEval Bootstrap: {ml_app} — YYYY-MM-DD - :
typereport - : single markdown cell with the evaluator suite summary
cells - :
time{ "live_span": "1h" }
Notebook creation fallback pattern (apply to every / call):
create_datadog_notebookedit_datadog_notebook- Try the MCP tool first.
- If the MCP call fails, inspect the error:
- Auth / permission error (401, 403) → stop and tell the user.
- Field validation error (error names a specific field) → fix that field and retry the MCP call once.
- Any other error (binding, serialization, unexpected response) → fall back to pup:
- Write the payload to as a full API envelope:
/tmp/nb_bootstrap_{ml_app}.json{"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}} - Run
pup notebooks create --file /tmp/nb_bootstrap_{ml_app}.json - If pup is not available either, render the notebook content as markdown in chat.
- Write the payload to
- After successful creation by either method, output the URL:
Evaluator suite exported to notebook: <url>
Notebook cell content — the markdown cell should contain:
markdown
undefined显示摘要后,提供Notebook导出选项。
-
如果阶段0检测到:
rca_notebook_url本次会话中之前创建了一个RCA Notebook:你想要(a) 将评估器套件摘要附加到该Notebook,还是(b) 创建新的独立Notebook?{rca_notebook_url}如果选择附加:使用Notebook创建回退模式(如下所述),调用(mcp__datadog-mcp__edit_datadog_notebook,id={rca_notebook_id},评估器套件摘要单元格)。append_only=true如果选择新建:使用Notebook创建回退模式(如下所述),调用。mcp__datadog-mcp__create_datadog_notebook -
如果未检测到:
rca_notebook_url是否要将此评估器套件摘要导出到Datadog Notebook?如果是:使用Notebook创建回退模式(如下所述),调用:mcp__datadog-mcp__create_datadog_notebook- :
nameEval Bootstrap: {ml_app} — YYYY-MM-DD - :
typereport - :包含评估器套件摘要的单个markdown单元格
cells - :
time{ "live_span": "1h" }
Notebook创建回退模式(适用于每个 / 调用):
create_datadog_notebookedit_datadog_notebook- 首先尝试MCP工具。
- 如果MCP调用失败,检查错误:
- 认证/权限错误(401、403) → 停止并告知用户。
- 字段验证错误(错误指出特定字段)→ 修复该字段并重试MCP调用一次。
- 任何其他错误(绑定、序列化、意外响应)→ 回退到pup:
- 将负载写入,作为完整API包:
/tmp/nb_bootstrap_{ml_app}.json{"data": {"attributes": {"name": "...", "time": {...}, "cells": [...]}, "type": "notebooks"}} - 运行
pup notebooks create --file /tmp/nb_bootstrap_{ml_app}.json - 如果pup也不可用,在聊天中渲染Notebook内容为markdown。
- 将负载写入
- 通过任一方法成功创建后,输出URL:
评估器套件已导出到Notebook:<url>
Notebook单元格内容 — markdown单元格应包含:
markdown
undefinedEval Bootstrap: {ml_app}
Eval Bootstrap: {ml_app}
Generated: YYYY-MM-DD | App profile: {LLM | RAG | Agent | Multi-agent} | Entry mode: {cold_start | from_rca}
Generated code:
{output_path}{One sentence: what does this app do?}
Coverage: {N} new evaluators ({comma-separated dimension names}) | {N} existing (unchanged: {names}) | {gaps if any: dimensions identified but not covered, and why}
生成时间:YYYY-MM-DD | 应用概况:{LLM | RAG | Agent | Multi-agent} | 进入模式:{cold_start | from_rca}
生成代码:
{output_path}{一句话:此应用的功能是什么?}
覆盖范围:{N}个新评估器({逗号分隔的维度名称}) | {N}个现有评估器(未更改:{名称}) | {如果有差距:已识别但未覆盖的维度及原因}
Evaluator Suite
评估器套件
| # | Name | Type | Measures | Pass Criteria |
|---|---|---|---|---|
| 1 | ... | ... | ... | ... |
| # | 名称 | 类型 | 测量内容 | 通过标准 |
|---|---|---|---|---|
| 1 | ... | ... | ... | ... |
Evidence
证据
{For each evaluator: name — 1-line description — [Trace link]}
{针对每个评估器:名称 — 一行描述 — [Trace链接]}
Next Steps
后续步骤
- Review generated prompts in
{output_path} - Run against a labeled dataset to validate scores
- Deploy to Datadog LLM Experiments
---- 审查中生成的提示
{output_path} - 针对标记数据集运行以验证分数
- 部署到Datadog LLM Experiments
---Phase 3B: Generate & Write Eval Spec JSON
阶段3B:生成并写入评估规范JSON
Goal: Serialize the confirmed evaluator suite and representative trace samples to a single self-contained JSON file — zero SDK dependencies.
Output path:
./evals/{ml_app}_eval_spec.json目标:将确认的评估器套件和代表性Trace样本序列化为单个独立的JSON文件——无SDK依赖。
输出路径:
./evals/{ml_app}_eval_spec.jsonJSON Schema
JSON Schema
json
{
"schema_version": "1",
"generated_at": "<ISO 8601 UTC>",
"generated_by": "eval-bootstrap",
"app": {
"ml_app": "<string>",
"app_type": "LLM | RAG | Agent | Multi-agent",
"trace_window": "<timeframe param, e.g. now-7d>",
"trace_count": "<integer>"
},
"evaluators": [
{
"name": "snake_case_name",
"category": "outcome | format | safety",
"type": "llm_judge | code_check",
"description": "<1-2 sentence plain-language description>",
"target_span": "<which span: root, llm sub-span, etc.>",
"scoring": {
"scale": "boolean | score_1_10 | categorical",
"categories": ["<only present when scale=categorical>"],
"pass_criteria": "<human-readable: true, >= 7, in [correct], etc.>"
},
"rubric": "<full prompt text for llm_judge; null for code_check>",
"implementation_hints": {
"type_if_code_check": "json_valid | regex | contains | length_words | null",
"pattern_if_code_check": "<pattern string or null>",
"notes": "<optional framework-agnostic implementation guidance>"
},
"evidence": [
{
"trace_id": "<32-char hex>",
"span_id": "<16-char hex>",
"url": "https://app.datadoghq.com/llm/traces?query=trace_id:<trace_id>",
"observation": "<why this trace illustrates the evaluator>"
}
]
}
],
"sample_records": [
{
"trace_id": "<string>",
"span_id": "<string>",
"input": {},
"output": "<string>",
"suggested_labels": {
"<evaluator_name>": "pass | fail | <score>"
}
}
]
}json
{
"schema_version": "1",
"generated_at": "<ISO 8601 UTC>",
"generated_by": "eval-bootstrap",
"app": {
"ml_app": "<string>",
"app_type": "LLM | RAG | Agent | Multi-agent",
"trace_window": "<timeframe参数,例如now-7d>",
"trace_count": "<integer>"
},
"evaluators": [
{
"name": "snake_case_name",
"category": "outcome | format | safety",
"type": "llm_judge | code_check",
"description": "<1-2句通俗易懂的描述>",
"target_span": "<哪个Span:根Span、llm子Span等>",
"scoring": {
"scale": "boolean | score_1_10 | categorical",
"categories": ["<仅当scale=categorical时存在>"],
"pass_criteria": "<人类可读:true, >= 7, in [correct]等>"
},
"rubric": "<llm_judge的完整提示文本;code_check为null>",
"implementation_hints": {
"type_if_code_check": "json_valid | regex | contains | length_words | null",
"pattern_if_code_check": "<模式字符串或null>",
"notes": "<可选的框架无关实现指导>"
},
"evidence": [
{
"trace_id": "<32字符十六进制>",
"span_id": "<16字符十六进制>",
"url": "https://app.datadoghq.com/llm/traces?query=trace_id:<trace_id>",
"observation": "<此Trace如何说明评估器>"
}
]
}
],
"sample_records": [
{
"trace_id": "<string>",
"span_id": "<string>",
"input": {},
"output": "<string>",
"suggested_labels": {
"<evaluator_name>": "pass | fail | <score>"
}
}
]
}Field Notes
字段说明
- :
evaluators[].typefor semantic evaluators;"llm_judge"for deterministic checks (regex, length, JSON validity, etc.)."code_check" - : For
evaluators[].rubric— full prompt text grounded in observed trace patterns. Usellm_judgeand{{input}}as generic placeholders (not{{output}}— that's ddeval-specific). For{{input_data}}— null.code_check - : Optional framework-agnostic guidance, e.g. "For OpenAI Evals, use
evaluators[].implementation_hints.notesas a model-graded criterion. For Braintrust, use as an LLM scorer. For Promptfoo, use as anrubricassertion."llm-rubric - : 10–20 representative traces from Phase 1.
sample_recordsare Claude's best-read from trace inspection — not ground truth. The field name communicates this explicitly.suggested_labels - PII rule: Strip emails, names, and sensitive data from all ,
input, andoutputfields before writing (same as Phase 3A).evidence[].observation
- :
evaluators[].type用于语义评估器;"llm_judge"用于确定性检查(正则、长度、JSON有效性等)。"code_check" - :对于
evaluators[].rubric——基于观察到的Trace模式的完整提示文本。使用llm_judge和{{input}}作为通用占位符(而非{{output}}——这是ddeval特定的)。对于{{input_data}}——null。code_check - :可选的框架无关指导,例如“对于OpenAI Evals,使用
evaluators[].implementation_hints.notes作为模型评分标准。对于Braintrust,将其用作LLM评分器。对于Promptfoo,将其用作rubric断言。”llm-rubric - :来自阶段1的10-20个代表性Trace。
sample_records是Claude通过Trace检查得出的最佳猜测——并非基准真值。字段名称明确传达了这一点。suggested_labels - PII规则:在写入前,从所有、
input和output字段中剥离电子邮件、姓名和敏感数据(与阶段3A相同)。evidence[].observation
Writing Instructions
写入说明
- Assemble the JSON object in memory following the schema above.
- Populate from traces already fetched in Phase 1. Fetch additional traces (up to 20 total) if fewer than 10 were read.
sample_records - Anonymize PII in all ,
input, andoutputfields.evidence[].observation - Write the file with 2-space indentation using the Write tool.
- Display a completion summary:
undefined- 在内存中按照上述schema组装JSON对象。
- 从阶段1已获取的Trace中填充。如果读取的Trace少于10个,获取额外的Trace(最多20个)。
sample_records - 匿名化所有、
input和output字段中的PII。evidence[].observation - 使用Write工具写入文件,使用2空格缩进。
- 显示完成摘要:
undefinedGenerated Eval Spec
生成的评估规范
Wrote :
./evals/{ml_app}_eval_spec.json- {N} evaluators ({outcome_count} outcome, {format_count} format, {safety_count} safety)
- {M} sample records with suggested labels
| # | Name | Category | Type | Pass Criteria |
|---|---|---|---|---|
| 1 | ... | ... | ... | ... |
已写入:
./evals/{ml_app}_eval_spec.json- {N}个评估器({outcome_count}个结果类,{format_count}个格式类,{safety_count}个安全类)
- {M}个样本记录,包含建议标签
| # | 名称 | 类别 | 类型 | 通过标准 |
|---|---|---|---|---|
| 1 | ... | ... | ... | ... |
Next Steps
后续步骤
- Review: Open and verify the rubrics match your expectations
./evals/{ml_app}_eval_spec.json - Implement: Use the field to configure evaluators in your framework of choice:
rubric- OpenAI Evals: use as a model-graded criterion
rubric - Braintrust: create an LLM scorer with the rubric text
- Promptfoo: use as an assertion
llm-rubric - Custom code: call your LLM API with the rubric and parse the structured output
- OpenAI Evals: use
- Label: are Claude's best guesses from trace inspection — verify against ground truth before using as training data
suggested_labels
undefined- 审查:打开并验证评估准则是否符合你的预期
./evals/{ml_app}_eval_spec.json - 实现:使用字段在你选择的框架中配置评估器:
rubric- OpenAI Evals:将用作模型评分标准
rubric - Braintrust:使用评估准则文本创建LLM评分器
- Promptfoo:将其用作断言
llm-rubric - 自定义代码:使用评估准则调用LLM API并解析结构化输出
- OpenAI Evals:将
- 标记:是Claude通过Trace检查得出的最佳猜测——在用作训练数据前,针对基准真值进行验证
suggested_labels
undefinedNotebook export (after summary)
Notebook导出(摘要后)
Same logic as Phase 3A — offer to append to the RCA notebook if was detected, or create a new standalone notebook. Use the same notebook cell format as Phase 3A, substituting with the JSON spec file path. In pup mode, use / as described in Phase 3A.
rca_notebook_urloutput_pathpup notebooks createpup notebooks edit与阶段3A逻辑相同——如果检测到,提供附加到RCA Notebook的选项,否则创建新的独立Notebook。使用与阶段3A相同的Notebook单元格格式,将替换为JSON规范文件路径。在pup模式下,按照阶段3A中的描述使用 / 。
rca_notebook_urloutput_pathpup notebooks createpup notebooks editPhase 3C: Publish Online Evaluators to Datadog
阶段3C:将在线评估器发布到Datadog
Goal: For each confirmed evaluator, write an LLM-judge configuration to Datadog via so it runs automatically on matching production spans.
create_or_update_llmobs_evaluator目标:针对每个确认的评估器,通过将LLM-judge配置写入Datadog,使其自动在匹配的生产Span上运行。
create_or_update_llmobs_evaluatorPre-publish checks (single message — parallelize)
发布前检查(单个消息——并行化)
For every proposed , call :
eval_nameget_llmobs_evaluator(eval_name=...)-
Not found → safe to create.
-
Found → existing evaluator with the same name. Surface a diff to the user (existing dimension/prompt vs. proposed) and ask:Evaluatoralready exists. Overwrite, rename, or skip?
{name}If overwrite: keep the fetched config as the base and merge your generated fields on top, then send the complete object back. The MCP tool is full-replace — any field you omit (e.g.,temperature,max_tokens,filter) reverts to its default. Never re-publish without round-tripping the existing config.sampling_percentageIf rename: append a suffix (e.g.) and treat as new._v2If skip: drop from the publish set.
针对每个拟议的,调用:
eval_nameget_llmobs_evaluator(eval_name=...)-
未找到 → 可以安全创建。
-
找到 → 存在同名的现有评估器。向用户显示差异(现有维度/提示与拟议内容)并询问:评估器已存在。是否覆盖、重命名或跳过?
{name}如果选择覆盖:将获取的配置作为基础,合并生成的字段,然后将完整对象返回。MCP工具是全替换模式——任何省略的字段(例如、temperature、max_tokens、filter)都会重置为默认值。切勿不往返现有配置就重新发布。sampling_percentage如果选择重命名:添加后缀(例如)并视为新评估器。_v2如果选择跳过:从发布集中移除。
Publishing Conventions
发布约定
Required parameters for each call: , (= ), , , , , , , plus a string.
create_or_update_llmobs_evaluatoreval_nameapplication_nameml_appenabledintegration_providermodel_nameprompt_templateparsing_typeoutput_schematelemetry.intentDefaults to use unless the user overrides:
| Field | Default |
|---|---|
| |
| |
| |
| |
| |
| |
| |
Prompt template: convert the LLMJudge prompt into the MCP shape — an ordered array of messages. The system prompt becomes , the user prompt becomes . Use span-data placeholders (see below) — not the offline / form, which only exists in .
{role, content}{role: "system"}{role: "user"}{{input_data}}{{output_data}}EvaluatorContext每个调用的必填参数:、(=)、、、、、、,以及字符串。
create_or_update_llmobs_evaluatoreval_nameapplication_nameml_appenabledintegration_providermodel_nameprompt_templateparsing_typeoutput_schematelemetry.intent默认值,除非用户覆盖:
| 字段 | 默认值 |
|---|---|
| |
| |
| |
| |
| |
| Span范围为 |
| |
提示模板:将LLMJudge提示转换为MCP格式——有序的消息数组。系统提示变为,用户提示变为。使用Span数据占位符(如下所述)——不要使用离线的 / 形式,这仅存在于中。
{role, content}{role: "system"}{role: "user"}{{input_data}}{{output_data}}EvaluatorContextOnline Template Variables
在线模板变量
Online evaluator prompts run through the dd-source library (). Missing paths → empty string. The data shape templates resolve against depends on :
templatedomains/ml-observability/shared/libs/templateeval_scope- (default) — placeholders resolve against a single span's JSON (the
eval_scope: spanJSON-marshaled to a map). Use the span aliases / dot-paths below directly.llmobs.Span - — placeholders resolve against the trace payload
eval_scope: trace. Use{ spans: [...] },{{spans[N]...}}, or{{spans[*]...}}to select span(s) before applying field paths. The{{spans[field.path:value]...}}/{{span_input}}aliases are not available in trace scope — reference span data through the{{span_output}}array instead.spans - — not supported by this skill; classify as
eval_scope: sessionand surface the limitation to the user.span
在线评估器提示通过dd-source 库()解析。路径不存在→空字符串。模板解析的数据形状取决于:
templatedomains/ml-observability/shared/libs/templateeval_scope- (默认)——占位符针对单个Span的JSON(
eval_scope: span序列化为映射)解析。直接使用下面的Span别名/点路径。llmobs.Span - ——占位符针对Trace负载
eval_scope: trace解析。使用{ spans: [...] }、{{spans[N]...}}或{{spans[*]...}}选择Span,然后应用字段路径。{{spans[field.path:value]...}}/{{span_input}}别名在Trace范围中不可用——通过{{span_output}}数组引用Span数据。spans - ——本技能不支持;分类为
eval_scope: session并向用户说明限制。span
Span-scope (eval_scope: span
)
eval_scope: spanSpan范围(eval_scope: span
)
eval_scope: spanBuilt-in span-kind-aware aliases (preferred when the evaluator is generic across span kinds):
| Alias | LLM span ( | Other spans (agent, workflow, task, …) |
|---|---|---|
| | |
| | |
Common explicit dot-paths (use when the evaluator is purpose-built for one span kind):
| Path | What you get |
|---|---|
| Plain string I/O on agent / workflow / task / tool spans |
| All input message contents on an LLM span (newline-joined) |
| First message (typically system prompt) |
| Assistant response(s) |
| Retrieved docs (RAG) — JSON-serialized |
| Custom metadata fields |
| Available tools — JSON array |
| Entire span as compact JSON (debug / fall-back catch-all) |
内置的Span类型感知别名(当评估器跨Span类型通用时优先使用):
| 别名 | LLM Span( | 其他Span(agent、workflow、task等) |
|---|---|---|
| | |
| | |
常见显式点路径(当评估器专为一种Span类型设计时使用):
| 路径 | 获取内容 |
|---|---|
| agent/workflow/task/tool Span的纯字符串输入/输出 |
| LLM Span的所有输入消息内容(换行连接) |
| 第一条消息(通常为系统提示) |
| 助手响应 |
| 检索到的文档(RAG)——JSON序列化 |
| 自定义元数据字段 |
| 可用工具——JSON数组 |
| 整个Span的紧凑JSON(调试/回退兜底) |
Trace-scope (eval_scope: trace
)
eval_scope: traceTrace范围(eval_scope: trace
)
eval_scope: trace| Pattern | What you get |
|---|---|
| JSON of every span in the trace |
| Single span by index — |
| All span names in order, newline-joined |
| All spans' outputs, newline-joined (handy for "final answer = last output") |
| Filter by span name |
| All LLM-kind span outputs |
| Whole tool spans as JSON, paired in/out — useful for tool-use correctness |
| Text of every retrieved document — useful for RAG faithfulness |
| Entire trace payload as JSON (debug fallback) |
| 模式 | 获取内容 |
|---|---|
| Trace中所有Span的JSON |
| 通过索引选择单个Span—— |
| 所有Span名称按顺序排列,换行连接 |
| 所有Span的输出,换行连接(适用于“最终答案=最后一个输出”) |
| 按Span名称过滤 |
| 所有LLM类型Span的输出 |
| 完整的工具Span JSON,包含输入/输出——适用于工具使用正确性 |
| 所有检索到的文档文本——适用于RAG忠实度 |
| 整个Trace负载的JSON(调试回退) |
Array selector syntax (applies to both scopes)
数组选择器语法(适用于两种范围)
- — index (0-based)
[N] - — inclusive range,
[START,END]is clamped to slice lengthEND - — wildcard (fan-out over all elements)
[*] - — filter array elements by a nested field equality, e.g.
[field.path:value]ormessages[role:user]spans[meta.span.kind:tool]
Resolution rules to keep in mind when writing prompts:
- Arrays of strings → newline-joined
- Arrays of objects / mixed values → compact JSON
- Single empty slice → empty string
- Implicit fan-out: behaves the same as
messages.contentmessages[*].content - Negative indices are not supported (parse error) — use with a known index, or
[N]for "last assistant turn" semantics[*]
When to pick which form:
- Generic span evaluator (e.g. ,
tone_check) → useoutput_format/{{span_input}}so it works across span kinds.{{span_output}} - LLM-span-specific evaluator (e.g. ) → reach for explicit
system_prompt_adherence/meta.input.messages[*].contentso you can split system vs. user vs. assistant turns.meta.output.messages[*].content - Span-scope RAG evaluator (single retrieval+generation span) → combine with
{{meta.input.documents}}.{{span_output}} - Trace-scope evaluator → see "Trace-scope evaluator examples" below for the four canonical patterns (goal completion, tool-use correctness, RAG faithfulness, conversation quality).
- Metadata-aware evaluator → reference directly.
{{meta.metadata.<key>}}
If the user has existing custom evaluators in the same ml_app (Phase 0 coverage map), match their convention when there is no strong reason to deviate.
- — 索引(从0开始)
[N] - — 包含范围,
[START,END]被钳制为切片长度END - — 通配符(遍历所有元素)
[*] - — 按嵌套字段相等性过滤数组元素,例如
[field.path:value]或messages[role:user]spans[meta.span.kind:tool]
编写提示时需记住的解析规则:
- 字符串数组→换行连接
- 对象数组/混合值→紧凑JSON
- 单个空切片→空字符串
- 隐式遍历:与
messages.content行为相同messages[*].content - 不支持负索引(解析错误)——使用指定已知索引,或使用
[N]实现“最后一个助手轮次”语义[*]
何时选择哪种形式:
- 通用Span评估器(例如、
tone_check)→ 使用output_format/{{span_input}},使其跨Span类型工作。{{span_output}} - LLM Span特定评估器(例如)→ 使用显式的
system_prompt_adherence/meta.input.messages[*].content,以便区分系统/用户/助手轮次。meta.output.messages[*].content - Span范围RAG评估器(单个检索+生成Span)→ 组合和
{{meta.input.documents}}。{{span_output}} - Trace范围评估器→ 请参阅下面的“Trace范围评估器示例”,了解四个规范模式(目标完成、工具使用正确性、RAG忠实度、对话质量)。
- 元数据感知评估器→ 直接引用。
{{meta.metadata.<key>}}
如果同一ml_app中存在现有自定义评估器(阶段0覆盖范围映射),在没有充分理由偏离时匹配其约定。
Trace-scope evaluator examples
Trace范围评估器示例
Concrete user-prompt bodies for the four canonical trace-scope use cases, drawn from the public docs (Trace-Level Evaluations). Each goes alongside a static System prompt that describes the rubric (no placeholders).
| Use case | | User prompt body |
|---|---|---|
| Goal completion — agent finished the user's request | | |
| Tool-use correctness — right tool with right arguments | | |
| RAG faithfulness — answer grounded in retrieved docs | | |
| Conversation quality — coherence and consistency across turns | | |
Use these as starting points. Adapt the and span paths to the actual span names / kinds the app emits (observed during Phase 1).
filteroutput_schemaThe field is NOT a bare JSON Schema. It must use the OpenAI object shape. is a fixed type discriminator, not the evaluator name — the UI validates it against a strict allowlist and rejects any other value:
output_schemajson_schemaname| LLMJudge type | | property key inside |
|---|---|---|
| Boolean | | |
| Score | | |
| Categorical | | |
The property key inside must match exactly. The array may only be or — any other value is rejected. Always include for UI display.
schema.propertiesnamerequired["<type_key>"]["<type_key>", "reasoning"]"reasoning": {"type": "string"}Boolean ():
BooleanStructuredOutput(pass_when=True)json
{
"output_schema": {
"name": "boolean_eval",
"strict": true,
"schema": {
"type": "object",
"properties": {
"boolean_eval": {"type": "boolean", "description": "Whether the criterion is met"},
"reasoning": {"type": "string", "description": "Explanation for the evaluation"}
},
"required": ["boolean_eval", "reasoning"],
"additionalProperties": false
}
},
"assessment_criteria": {"pass_when": true}
}Score ():
ScoreStructuredOutput(min_score=1, max_score=10, min_threshold=7)json
{
"output_schema": {
"name": "score_eval",
"strict": true,
"schema": {
"type": "object",
"properties": {
"score_eval": {"type": "number", "description": "Score from 1 to 10", "minimum": 1, "maximum": 10},
"reasoning": {"type": "string", "description": "Explanation for the score"}
},
"required": ["score_eval", "reasoning"],
"additionalProperties": false
}
},
"assessment_criteria": {"min_threshold": 7}
}Add to if set.
max_thresholdassessment_criteriaCategorical ():
CategoricalStructuredOutput(categories={...}, pass_values=[...])json
{
"output_schema": {
"name": "categorical_eval",
"strict": true,
"schema": {
"type": "object",
"properties": {
"categorical_eval": {
"type": "string",
"anyOf": [
{"const": "correct", "description": "The response correctly answers the question"},
{"const": "partially_correct", "description": "Partially correct but missing information"},
{"const": "incorrect", "description": "The response is wrong or irrelevant"}
]
},
"reasoning": {"type": "string", "description": "Explanation for the category chosen"}
},
"required": ["categorical_eval", "reasoning"],
"additionalProperties": false
}
},
"assessment_criteria": {"pass_values": ["correct"]}
}Note: categorical uses alongside (each is a string value), unlike the offline SDK which uses bare at the property root.
"type": "string"anyOfconstanyOfCustom / multi-dimensional: not directly supported via the fixed-name schema. Implement as a score or categorical evaluator where possible, or split into multiple evaluators. The must be one of the three fixed values above.
nameFilter scoping: when the proposal targets a specific span kind (e.g. an LLM sub-span), translate it into an EVP query — e.g. , , or a more specific tag. Combine with only when the target is the trace root.
filter@meta.span.kind:llmservice:checkout-agentroot_spans_only:trueFor :
eval_scope: trace- The evaluator triggers once per completed trace, after a 3-minute inactivity window. Late-arriving spans (>3 min after the prior span on the same trace) are excluded from the evaluation. Surface this in the proposal so the user knows about both the latency and the potential miss for sparse-activity agents (long-running agents whose steps are sparser than 3 minutes apart).
- The query must match the trace's root span only — always include
filter(or@parent_id:undefined) to avoid double-firing across descendants. Combine withroot_spans_only: true(or whatever kind the app uses for root spans, observed in Phase 1) for narrowing.@meta.span.kind:agent - Sampling at trace scope is heavier than at span scope (one trace = many spans on the judge's side). Default to
sampling_percentagefor trace-scope evaluators (instead of the span default5); the user can raise it after a manual review pass.10
四个规范Trace范围用例的具体用户提示主体,来自公开文档(Trace-Level Evaluations)。每个提示主体都配有描述评估准则的静态系统提示(无占位符)。
| 用例 | | 用户提示主体 |
|---|---|---|
| 目标完成 — 代理是否完成了用户请求 | | |
| 工具使用正确性 — 是否使用了正确的工具和参数 | | |
| RAG忠实度 — 回答是否基于检索到的文档 | | |
| 对话质量 — 多轮对话的连贯性和一致性 | | |
将这些作为起点。根据阶段1中观察到的应用实际Span名称/类型,调整和Span路径。
filteroutput_schemaoutput_schemajson_schemaname| LLMJudge类型 | | |
|---|---|---|
| 布尔值 | | |
| 分数 | | |
| 分类 | | |
schema.propertiesnamerequired["<type_key>"]["<type_key>", "reasoning"]"reasoning": {"type": "string"}布尔值():
BooleanStructuredOutput(pass_when=True)json
{
"output_schema": {
"name": "boolean_eval",
"strict": true,
"schema": {
"type": "object",
"properties": {
"boolean_eval": {"type": "boolean", "description": "是否符合标准"},
"reasoning": {"type": "string", "description": "评估解释"}
},
"required": ["boolean_eval", "reasoning"],
"additionalProperties": false
}
},
"assessment_criteria": {"pass_when": true}
}分数():
ScoreStructuredOutput(min_score=1, max_score=10, min_threshold=7)json
{
"output_schema": {
"name": "score_eval",
"strict": true,
"schema": {
"type": "object",
"properties": {
"score_eval": {"type": "number", "description": "1到10分", "minimum": 1, "maximum": 10},
"reasoning": {"type": "string", "description": "评分解释"}
},
"required": ["score_eval", "reasoning"],
"additionalProperties": false
}
},
"assessment_criteria": {"min_threshold": 7}
}如果设置了,将其添加到中。
max_thresholdassessment_criteria分类():
CategoricalStructuredOutput(categories={...}, pass_values=[...])json
{
"output_schema": {
"name": "categorical_eval",
"strict": true,
"schema": {
"type": "object",
"properties": {
"categorical_eval": {
"type": "string",
"anyOf": [
{"const": "correct", "description": "响应正确回答了问题"},
{"const": "partially_correct", "description": "部分正确但缺少信息"},
{"const": "incorrect", "description": "响应错误或无关"}
]
},
"reasoning": {"type": "string", "description": "类别选择解释"}
},
"required": ["categorical_eval", "reasoning"],
"additionalProperties": false
}
},
"assessment_criteria": {"pass_values": ["correct"]}
}注意:分类使用和(每个是字符串值),与离线SDK不同,离线SDK在属性根级别使用裸。
"type": "string"anyOfconstanyOf自定义/多维:无法通过固定名称schema直接支持。尽可能实现为分数或分类评估器,或拆分为多个评估器。必须是上述三个固定值之一。
name过滤范围:当提案针对特定Span类型(例如LLM子Span)时,将其转换为EVP 查询——例如、或更具体的标签。仅当目标是Trace根Span时,才结合。
filter@meta.span.kind:llmservice:checkout-agentroot_spans_only:true对于:
eval_scope: trace- 评估器在每个完成的Trace上触发一次,等待3分钟无活动窗口。同一Trace中前一个Span后超过3分钟到达的Span会被排除在评估之外。在提案中说明这一点,以便用户了解延迟和稀疏活动代理的潜在遗漏(步骤间隔超过3分钟的长运行代理)。
- 查询必须仅匹配Trace的根Span——始终包含
filter(或@parent_id:undefined),避免在后代Span上重复触发。结合root_spans_only: true(或阶段1中观察到的应用根Span类型)进行缩小范围。@meta.span.kind:agent - Trace范围的采样比Span范围更重(一个Trace=Judge侧的多个Span)。Trace范围评估器的默认为**
sampling_percentage**(而非Span范围的默认5);用户在手动审查后可以提高该值。10
Always publish as draft (enabled: false
)
enabled: false始终以草稿形式发布(enabled: false
)
enabled: falseAlways create / update evaluators with — regardless of whether was auto-detected from existing evaluators. The UI is the source of truth for activation; the skill should never auto-enable evaluators on the user's behalf. The user reviews each draft in the UI, confirms the integration account is correct (the auto-detected ID may belong to a different judge LLM than the one they want for this app), and flips the toggle when they're satisfied.
enabled: falseintegration_account_idThis makes the workflow safe by default: a wrong , a mistuned prompt, or an over-broad filter never goes live without a human pass. Auto-detection of the account ID still helps because the draft renders with the right account pre-selected — review is faster.
integration_account_id始终以创建/更新评估器——无论是否从现有评估器自动检测到。UI是激活的权威来源;技能绝不能代表用户自动启用评估器。用户在UI中审查每个草稿,确认集成账户正确(自动检测的ID可能属于与用户为此应用所需不同的Judge LLM),并在满意后切换开关。
enabled: falseintegration_account_id这使工作流默认安全:错误的、调整不当的提示或过于宽泛的过滤器在没有人工检查的情况下永远不会生效。账户ID的自动检测仍然有用,因为草稿会预先选择正确的账户——审查更快。
integration_account_idintegration_account_id resolution
integration_account_id解析
The is an opaque UUID that the UI matches against the org's integration accounts list to populate the account section dropdown. Users typically don't know this value, so never ask the user to supply a raw UUID.
integration_account_idResolution order:
-
Inherit from existing evaluators — in Phase 0 you calledfor each existing custom evaluator. Check the
get_llmobs_evaluatorfield on those responses. If any of them have a value, use that same ID on the published drafts. If multiple different IDs appear across existing evaluators, pick the most common one and note which you chose so the user can correct it during the UI review pass.llm_provider.integration_account_id -
Omit if no existing evaluator has one — if no custom evaluator in the ml_app has an, omit the field from the publish payload. The draft will render without an account pre-selected; the user picks one during the UI review pass before activating.
integration_account_id
Either way, the evaluator is published with . The user is the gate — see "Always publish as draft" above.
enabled: falseintegration_account_id解析顺序:
-
从现有评估器继承——在阶段0中,你调用了获取每个现有自定义评估器。检查这些响应中的
get_llmobs_evaluator字段。如果其中任何一个有值,在发布的草稿中使用相同的ID。如果现有评估器中出现多个不同的ID,选择最常见的一个并说明你选择了哪个,以便用户在UI审查过程中更正。llm_provider.integration_account_id -
如果没有现有评估器包含该ID则省略——如果ml_app中的自定义评估器都没有,从发布负载中省略该字段。草稿将在未预先选择账户的情况下呈现;用户在激活前的UI审查过程中选择一个账户。
integration_account_id
无论哪种情况,评估器都以发布。用户是把关人——请参阅上面的“始终以草稿形式发布”。
enabled: falsePublish (single message — parallelize)
发布(单个消息——并行化)
Issue all calls in a single message (one per evaluator). Set to a short English description like .
create_or_update_llmobs_evaluatortelemetry.intent"skill:llm-obs-eval-bootstrap — Bootstrap evaluator suite for ml_app=<ml_app> from production trace analysis."If any call fails, capture the error and continue with the remaining evaluators — never silently abort the batch. Report failures explicitly in the summary.
在单个消息中发起所有调用(每个评估器对应一次调用)。将设置为简短的英文描述,例如。
create_or_update_llmobs_evaluatortelemetry.intent"skill:llm-obs-eval-bootstrap — Bootstrap evaluator suite for ml_app=<ml_app> from production trace analysis."如果任何调用失败,捕获错误并继续处理剩余评估器——切勿静默中止批量操作。在摘要中明确报告失败情况。
Summary
摘要
undefinedundefinedPublished Evaluators (drafts — pending UI review)
已发布评估器(草稿——待UI审查)
Wrote {N} online evaluators to ml_app . All published as drafts () — review and activate them in the UI before they start scoring spans.
{ml_app}enabled: false| # | Name | Action | Provider/Model | Sampling | Scope | Account auto-detected | Status |
|---|---|---|---|---|---|---|---|
| 1 | task_completion | created (draft) | openai/gpt-5.4-mini | 10% | span | yes | ok |
| 2 | response_groundedness | overwrote (draft) | openai/gpt-5.4-mini | 10% | span | yes | ok |
| 3 | scope_adherence | renamed ( | openai/gpt-5.4-mini | 10% | span | no — pick in UI | ok |
| 4 | citation_format | failed | openai/gpt-5.4-mini | 10% | span | — | error |
{If any failed:}
Errors:
- : {error message}
{name}
{If any code-based proposals were dropped:}
Not published (code-based, not supported by online evaluator API):
- ({type}) — consider running offline via
{name}(SDK mode)./eval-bootstrap {ml_app}
已将{N}个在线评估器写入ml_app 。所有评估器均以草稿形式发布()——在它们开始为Span评分前,在UI中审查并激活它们。
{ml_app}enabled: false| # | 名称 | 操作 | 提供商/模型 | 采样率 | 范围 | 账户自动检测 | 状态 |
|---|---|---|---|---|---|---|---|
| 1 | task_completion | 创建(草稿) | openai/gpt-5.4-mini | 10% | span | 是 | 成功 |
| 2 | response_groundedness | 覆盖(草稿) | openai/gpt-5.4-mini | 10% | span | 是 | 成功 |
| 3 | scope_adherence | 重命名( | openai/gpt-5.4-mini | 10% | span | 否——在UI中选择 | 成功 |
| 4 | citation_format | 失败 | openai/gpt-5.4-mini | 10% | span | — | 错误 |
{如果有失败情况:}
错误:
- : {错误消息}
{name}
{如果有基于代码的提案被丢弃:}
未发布(基于代码,在线评估器API不支持):
- ({type}) — 考虑通过
{name}(SDK模式)离线运行。/eval-bootstrap {ml_app}
Next Steps — review and activate in the UI
后续步骤——在UI中审查并激活
The drafts are intentionally not running yet. Walk through each one in the Datadog UI before flipping the enable toggle:
- Open the drafts: Datadog → LLM Observability → Evaluations → filter by ml_app (the new drafts appear with status
{ml_app}).Disabled - For each draft:
- Verify the integration account in the Provider section. If the column above shows , confirm it's the correct account for the judge LLM you want this evaluator to call through. If
auto-detected: yes, pick an account from the dropdown.no - Skim the prompt template and the structured-output schema — make sure the spans-vs-trace scope, filter, and sampling match what you actually want to measure.
- Click into a sample span/trace and use the test pane to dry-run the prompt against real data. Confirm the result matches your expectation.
- Verify the integration account in the Provider section. If the column above shows
- Enable: once each draft passes review, toggle it to enabled. Datadog starts scoring incoming spans immediately.
- Wait for first scores: with (span scope) or
sampling_percentage=10(trace scope), expect first results within minutes for high-traffic apps.5 - Tune sampling/filter: if results are noisy or volume is too high, reduce or tighten the
sampling_percentagefrom the UI. Re-runningfilterwill round-trip the existing config before overwriting — your manual tweaks survive across reruns./eval-bootstrap {ml_app} --publish
undefined草稿目前未运行。在切换启用开关前,在Datadog UI中逐个检查:
- 打开草稿:Datadog → LLM Observability → Evaluations → 按ml_app 过滤(新草稿显示为
{ml_app}状态)。Disabled - 针对每个草稿:
- 验证集成账户:在提供商部分。如果上面的列显示,确认它是你希望此评估器调用的Judge LLM的正确账户。如果显示
自动检测:是,从下拉菜单中选择一个账户。否 - 浏览提示模板和结构化输出schema——确保Span/Trace范围、过滤器和采样率与你实际要测量的内容匹配。
- 点击示例Span/Trace并使用测试窗格针对真实数据试运行提示。确认结果符合你的预期。
- 验证集成账户:在提供商部分。如果上面的列显示
- 启用:每个草稿通过审查后,切换为启用状态。Datadog立即开始为传入Span评分。
- 等待首次评分:对于(Span范围)或
sampling_percentage=10(Trace范围),高流量应用预计几分钟内会出现首次结果。5 - 调整采样/过滤器:如果结果嘈杂或流量过高,从UI中降低或收紧
sampling_percentage。重新运行filter会在覆盖前往返现有配置——你的手动调整会在重新运行后保留。/eval-bootstrap {ml_app} --publish
undefinedNotebook export (after summary)
Notebook导出(摘要后)
Same logic as Phase 3A — offer to append to the RCA notebook if was detected, or create a new standalone notebook. The notebook cell should list the published evaluators with their UI links and the they target. In pup mode, use / as described in Phase 3A.
rca_notebook_urlml_apppup notebooks createpup notebooks edit与阶段3A逻辑相同——如果检测到,提供附加到RCA Notebook的选项,否则创建新的独立Notebook。Notebook单元格应列出已发布的评估器及其UI链接和目标ml_app。在pup模式下,按照阶段3A中的描述使用 / 。
rca_notebook_urlpup notebooks createpup notebooks editOperating Rules
操作规则
- Breadth over precision; let the user curate: Propose 8–15 evaluators distributed across domain-specific (largest bucket — derived from Phase 1 domain signals), outcome, format, and safety. Users can always remove what doesn't fit their quality bar; they cannot easily add what was not proposed. Anchor every domain-specific proposal in at least one observed trace pattern — don't invent generic domain evaluators without evidence.
- Don't overfit: Write criteria that generalize beyond the specific sampled traces. Use examples as grounding, not as the sole criteria.
- Show your work: Every proposed evaluator cites at least one trace as evidence with a clickable link: .
[Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id}) - New file only: Never modify existing evaluator code or experiment configurations.
- Honest about uncertainty: If fewer than 5 traces support a proposed evaluator, flag it as tentative.
- 广度优先于精度,让用户筛选:提出8-15个评估器,分布在领域特定(最大类别——来自阶段1的领域信号)、结果、格式和安全类中。用户始终可以移除不符合其质量标准的评估器;但他们无法轻松添加未提出的评估器。每个领域特定提案都必须至少有一个观察到的Trace模式作为基础——不要在没有证据的情况下发明通用领域评估器。
- 不要过度拟合:编写的标准应超出特定采样Trace的范围。使用示例作为基础,而非唯一标准。
- 展示工作过程:每个拟议评估器至少引用一个Trace作为证据,并提供可点击链接:。
[Trace {first_8}...](https://app.datadoghq.com/llm/traces?query=trace_id:{full_32_char_id}) - 仅创建新文件:切勿修改现有评估器代码或实验配置。
- 诚实面对不确定性:如果支持拟议评估器的Trace少于5个,标记为暂定。
Tool Reference
工具参考
This appendix applies only in pup mode. In MCP mode, use the tool names in the workflow sections directly.
本附录仅适用于pup模式。在MCP模式下,直接使用工作流章节中的工具名称。
Spans and traces
Span和Trace
| MCP Tool | pup Command |
|---|---|
| |
| |
| |
| |
| |
| |
| |
| MCP工具 | pup命令 |
|---|---|
| |
| |
| |
| |
| |
| |
| |
Evaluators
评估器
| MCP Tool | pup Command |
|---|---|
| |
| |
| |
| |
| |
| |
| MCP工具 | pup命令 |
|---|---|
| |
| |
| |
| |
| |
| |
create_or_update_llmobs_evaluator
in pup mode
create_or_update_llmobs_evaluatorpup模式下的create_or_update_llmobs_evaluator
create_or_update_llmobs_evaluatorpup uses a flat JSON file (all fields top-level). returns a nested object. Transform as follows:
get-evaluator- Round-trip check: Call first. If it exists, start from its config.
pup llm-obs evals get-evaluator EVAL_NAME - Flatten : hoist
llm_provider,integration_provider,model_name,integration_account_idto top level, dropping thetemperaturekey.llm_provider - Merge and set .
enabled: false - Write to temp file and call:
Use unique temp file names when publishing multiple evaluators in parallel (e.g.bash
pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json)./tmp/eval_toxicity.json
| Flat JSON key |
|---|---|
| |
| |
| |
| |
| All other fields | Unchanged (already top-level) |
pup使用扁平JSON文件(所有字段均为顶级)。返回嵌套对象。转换方式如下:
get-evaluator- 往返检查:首先调用。如果存在,从其配置开始。
pup llm-obs evals get-evaluator EVAL_NAME - 扁平化:将
llm_provider、integration_provider、model_name、integration_account_id提升到顶级,删除temperature键。llm_provider - 合并并设置。
enabled: false - 写入临时文件并调用:
并行发布多个评估器时使用唯一的临时文件名(例如bash
pup llm-obs evals create-or-update EVAL_NAME --file /tmp/eval_EVAL_NAME.json)。/tmp/eval_toxicity.json
| 扁平JSON键 |
|---|---|
| |
| |
| |
| |
| 所有其他字段 | 不变(已为顶级) |
Notebooks
Notebook
| MCP Tool | pup Command |
|---|---|
| |
| |
The cells file is a JSON array of cell objects:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]- MCP result parsing safety: Before writing any script (Python, jq, etc.) that iterates over or accesses fields in an MCP tool result, inspect the raw structure first — check , top-level keys, and whether the payload is nested inside a content block (e.g.
type(result)). Extract and[{'type': 'text', 'text': '<json>'}]the inner payload if needed before parsing. Never assume MCP results are bare dicts or lists.json.loads()
| MCP工具 | pup命令 |
|---|---|
| |
| |
单元格文件是单元格对象的JSON数组:
json
[{"attributes": {"definition": {"type": "markdown", "text": "## Section\n\nContent."}}, "type": "notebook_cells"}]- MCP结果解析安全:在编写任何迭代或访问MCP工具结果中字段的脚本(Python、jq等)之前,先检查原始结构——检查、顶级键,以及负载是否嵌套在内容块中(例如
type(result))。如果需要,提取并[{'type': 'text', 'text': '<json>'}]内部负载后再解析。切勿假设MCP结果是裸字典或列表。json.loads()