aws-bedrock-evals

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AWS Bedrock Evaluation Jobs

Overview

概述

Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.

Pre-computed Inference vs Live Inference

Mode	How it works	Use when
Live Inference	Bedrock generates responses during the eval job	Simple prompt-in/text-out, no tool calling
Pre-computed Inference	You pre-collect responses and supply them in a JSONL dataset	Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock

Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.

Pipeline

Design Scenarios → Collect Responses → Upload to S3 → Run Eval Job → Parse Results → Act on Findings
       |                  |                  |               |               |               |
  scenarios.json    Your app's API     s3://bucket/     create-         s3 sync +       Fix prompt,
  (multi-turn)      → dataset JSONL    datasets/        evaluation-job  parse JSONL      retune metrics

Amazon Bedrock Evaluation Jobs通过使用独立的评估者模型（即“评判者”），根据一组指标对提示词-响应配对进行评分，以此衡量基于Bedrock构建的应用的性能。评判者会结合特定指标的指令读取每一对内容，并给出数值评分和书面推理过程。

预计算推理 vs 实时推理

模式	工作原理	适用场景
实时推理	Bedrock在评估任务执行过程中生成响应	简单的提示词输入-文本输出场景，无需工具调用
预计算推理	您预先收集响应并以JSONL数据集的形式提供	涉及工具调用、多轮对话、自定义编排、或使用Bedrock之外模型的场景

当您的应用涉及工具使用、Agent循环、多轮状态管理或外部编排时，使用预计算推理。

流水线流程

设计测试场景 → 收集响应 → 上传至S3 → 运行评估任务 → 解析结果 → 根据发现优化
       |                  |                  |               |               |               |
  scenarios.json    您的应用API     s3://bucket/     create-         s3同步 +       优化提示词,
  (多轮对话)      → 数据集JSONL    datasets/        evaluation-job  解析JSONL      调整指标

Agent Behavior: Gather Inputs and Show Cost Estimate

Agent行为：收集输入并展示成本估算

Before generating any configs, scripts, or artifacts, you MUST gather the following from the user:

AWS Region — Which region to use (default:
```
us-east-1
```
). Affects model availability and pricing.
Target model — The model their application uses (e.g.,
```
amazon.nova-lite-v1:0
```
,
```
anthropic.claude-3-haiku
```
).
Evaluator (judge) model — The model to score responses (e.g.,
```
amazon.nova-pro-v1:0
```
). Should be at least as capable as the target.
Application type — Brief description of what the app does. Used to design test scenarios and derive custom metrics.
Number of test scenarios — How many they plan to test (recommend 13-20 for first run).
Estimated JSONL entries — Derived from scenarios x avg turns per scenario.
Number of metrics — Total (built-in + custom). Recommend starting with 6 built-in + 3-5 custom.
S3 bucket — Existing bucket name or confirm creation of a new one.
IAM role — Existing role ARN or confirm creation of a new one.

在生成任何配置、脚本或工件之前，您必须向用户收集以下信息：

AWS区域 — 使用的区域（默认值：
```
us-east-1
```
）。会影响模型可用性和定价。
目标模型 — 应用所使用的模型（例如：
```
amazon.nova-lite-v1:0
```
,
```
anthropic.claude-3-haiku
```
）。
评估者（评判者）模型 — 用于对响应评分的模型（例如：
```
amazon.nova-pro-v1:0
```
）。其能力应至少与目标模型相当。
应用类型 — 应用功能的简要描述。用于设计测试场景和推导自定义指标。
测试场景数量 — 计划测试的场景数量（首次运行建议13-20个）。
预估JSONL条目数 — 由场景数 × 平均每场景对话轮次得出。
指标数量 — 总数量（内置指标 + 自定义指标）。建议从6个内置指标 + 3-5个自定义指标开始。
S3存储桶 — 现有存储桶名称，或确认是否创建新存储桶。
IAM角色 — 现有角色ARN，或确认是否创建新角色。

Cost Estimate

成本估算

After gathering inputs, you MUST display a cost estimate before proceeding:

undefined

收集完输入信息后，您必须先展示成本估算，再继续后续操作：

undefined

Estimated Cost Summary

预估成本汇总

Item	Details	Est. Cost
Response collection	{N} prompts x ~{T} tokens x {target_model_price}	${X.XX}
Evaluation job	{N} prompts x {M} metrics x ~1,700 tokens x {judge_model_price}	${X.XX}
S3 storage	< 1 MB	< $0.01
Total per run		~${X.XX}

Scaling: Each additional run costs ~${X.XX}. Adding 1 custom metric adds ~${Y.YY}/run.


**Cost formulas:**
- **Response collection**: `num_prompts x avg_input_tokens x input_price + num_prompts x avg_output_tokens x output_price`
- **Evaluation job**: `num_prompts x num_metrics x ~1,500 input_tokens x judge_input_price + num_prompts x num_metrics x ~200 output_tokens x judge_output_price`

**Model pricing reference:**

| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|-------|----------------------|------------------------|
| amazon.nova-lite-v1:0 | $0.06 | $0.24 |
| amazon.nova-pro-v1:0 | $0.80 | $3.20 |
| anthropic.claude-3-haiku | $0.25 | $1.25 |
| anthropic.claude-3-sonnet | $3.00 | $15.00 |

---

项目	详情	预估成本
响应收集	{N}个提示词 × ~{T}个token × {target_model_price}	${X.XX}
评估任务	{N}个提示词 × {M}个指标 × ~1,700个token × {judge_model_price}	${X.XX}
S3存储	< 1 MB	< $0.01
每次运行总成本		~${X.XX}

扩展说明：每额外运行一次成本约为${X.XX}。添加1个自定义指标会使每次运行成本增加约${Y.YY}。


**成本计算公式：**
- **响应收集**：`提示词数量 × 平均输入token数 × 输入价格 + 提示词数量 × 平均输出token数 × 输出价格`
- **评估任务**：`提示词数量 × 指标数量 × ~1,500个输入token × 评判者模型输入价格 + 提示词数量 × 指标数量 × ~200个输出token × 评判者模型输出价格`

**模型定价参考：**

| 模型 | 输入（每100万token） | 输出（每100万token） |
|-------|----------------------|------------------------|
| amazon.nova-lite-v1:0 | $0.06 | $0.24 |
| amazon.nova-pro-v1:0 | $0.80 | $3.20 |
| anthropic.claude-3-haiku | $0.25 | $1.25 |
| anthropic.claude-3-sonnet | $3.00 | $15.00 |

---

Prerequisites

前置条件

bash

undefined

bash

undefined

AWS CLI 2.33+ required (older versions silently drop customMetricConfig/precomputedInferenceSource fields)

需要AWS CLI 2.33+版本（旧版本会静默丢弃customMetricConfig/precomputedInferenceSource字段）

aws --version

Verify target model access

验证目标模型权限

aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION

Verify evaluator model access

验证评估者模型权限

aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION


Good evaluator model choices: `amazon.nova-pro-v1:0`, `anthropic.claude-3-sonnet`, `anthropic.claude-3-haiku`. The evaluator should be at least as capable as your target model.

---

aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION


推荐的评估者模型：`amazon.nova-pro-v1:0`, `anthropic.claude-3-sonnet`, `anthropic.claude-3-haiku`。评估者模型的能力应至少与您的目标模型相当。

---

Step 1: Design Test Scenarios

步骤1：设计测试场景

List the application's functional areas (e.g., greeting, booking-flow, error-handling, etc.). Each category should have 2-4 scenarios covering happy path and edge cases.

Scenario JSON format:

json

[
  {
    "id": "greeting-known-user",
    "category": "greeting",
    "context": { "userId": "user-123" },
    "turns": ["hello"]
  },
  {
    "id": "multi-step-flow",
    "category": "core-flow",
    "context": { "userId": "user-456" },
    "turns": [
      "hello",
      "I need help with X",
      "yes, proceed with that",
      "thanks"
    ]
  }
]

The

context

field holds any session/user data your app needs. Each turn in the array is one user message; the collection step handles the multi-turn conversation loop.

Edge case coverage dimensions:

Happy path: standard usage that should work perfectly
Missing information: user omits required fields
Unavailable resources: requested item doesn't exist
Out-of-scope requests: user asks something the app shouldn't handle
Error recovery: bad input, invalid data
Tone stress tests: complaints, frustration

Recommended count: 13-20 scenarios producing 30-50 JSONL entries (multi-turn scenarios produce one entry per turn).

列出应用的功能领域（例如：问候、预订流程、错误处理等）。每个类别应包含2-4个场景，覆盖正常路径和边缘情况。

场景JSON格式：

json

[
  {
    "id": "greeting-known-user",
    "category": "greeting",
    "context": { "userId": "user-123" },
    "turns": ["hello"]
  },
  {
    "id": "multi-step-flow",
    "category": "core-flow",
    "context": { "userId": "user-456" },
    "turns": [
      "hello",
      "I need help with X",
      "yes, proceed with that",
      "thanks"
    ]
  }
]

context

字段存储应用所需的会话/用户数据。数组中的每一轮对话对应一条用户消息；收集步骤会处理多轮对话循环。

边缘情况覆盖维度：

正常路径：应完美运行的标准使用场景
信息缺失：用户遗漏必填字段
资源不可用：请求的项目不存在
超出范围的请求：用户询问应用不应处理的内容
错误恢复：无效输入、错误数据
语气压力测试：投诉、不满情绪

推荐数量： 13-20个场景，生成30-50条JSONL条目（多轮场景每轮对话生成一条条目）。

Step 2: Collect Responses

步骤2：收集响应

Collect responses from your application however it runs. The goal is to produce a JSONL dataset file where each line contains the prompt, the model's response, and metadata.

Example pattern: Converse API with tool-calling loop (TypeScript)

This applies when your application uses Bedrock with tool calling:

typescript

import {
  BedrockRuntimeClient,
  ConverseCommand,
  type Message,
  type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

async function converseLoop(
  messages: Message[],
  systemPrompt: SystemContentBlock[],
  tools: any[]
): Promise<string> {
  const MAX_TOOL_ROUNDS = 10;

  for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
    const response = await client.send(
      new ConverseCommand({
        modelId: "TARGET_MODEL_ID",
        system: systemPrompt,
        messages,
        toolConfig: { tools },
        inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
      })
    );

    const assistantContent = response.output?.message?.content as any[];
    if (!assistantContent) return "[No response from model]";

    messages.push({ role: "assistant", content: assistantContent });

    const toolUseBlocks = assistantContent.filter(
      (block: any) => block.toolUse != null
    );

    if (toolUseBlocks.length === 0) {
      return assistantContent
        .filter((block: any) => block.text != null)
        .map((block: any) => block.text as string)
        .join("\n") || "[Empty response]";
    }

    const toolResultBlocks: any[] = [];
    for (const block of toolUseBlocks) {
      const { toolUseId, name, input } = block.toolUse;
      const result = await executeTool(name, input);
      toolResultBlocks.push({
        toolResult: { toolUseId, content: [{ json: result }] },
      });
    }

    messages.push({ role: "user", content: toolResultBlocks } as Message);
  }

  return "[Max tool rounds exceeded]";
}

Multi-turn handling: Maintain the

messages

array across turns and build the dataset prompt field with conversation history:

typescript

const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];

for (let i = 0; i < scenario.turns.length; i++) {
  const userTurn = scenario.turns[i];
  messages.push({ role: "user", content: [{ text: userTurn }] });

  const assistantText = await converseLoop(messages, systemPrompt, tools);

  conversationHistory.push({ role: "user", text: userTurn });
  conversationHistory.push({ role: "assistant", text: assistantText });

  let prompt: string;
  if (i === 0) {
    prompt = userTurn;
  } else {
    prompt = conversationHistory
      .map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
      .join("\n");
  }

  entries.push({
    prompt,
    category: scenario.category,
    referenceResponse: "",
    modelResponses: [
      { response: assistantText, modelIdentifier: "my-app-v1" },
    ],
  });
}

通过应用的运行方式收集响应。目标是生成一个JSONL数据集文件，其中每一行包含提示词、模型响应和元数据。

示例模式：带工具调用循环的Converse API（TypeScript）

适用于您的应用使用Bedrock进行工具调用的场景：

typescript

import {
  BedrockRuntimeClient,
  ConverseCommand,
  type Message,
  type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

async function converseLoop(
  messages: Message[],
  systemPrompt: SystemContentBlock[],
  tools: any[]
): Promise<string> {
  const MAX_TOOL_ROUNDS = 10;

  for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
    const response = await client.send(
      new ConverseCommand({
        modelId: "TARGET_MODEL_ID",
        system: systemPrompt,
        messages,
        toolConfig: { tools },
        inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
      })
    );

    const assistantContent = response.output?.message?.content as any[];
    if (!assistantContent) return "[No response from model]";

    messages.push({ role: "assistant", content: assistantContent });

    const toolUseBlocks = assistantContent.filter(
      (block: any) => block.toolUse != null
    );

    if (toolUseBlocks.length === 0) {
      return assistantContent
        .filter((block: any) => block.text != null)
        .map((block: any) => block.text as string)
        .join("\n") || "[Empty response]";
    }

    const toolResultBlocks: any[] = [];
    for (const block of toolUseBlocks) {
      const { toolUseId, name, input } = block.toolUse;
      const result = await executeTool(name, input);
      toolResultBlocks.push({
        toolResult: { toolUseId, content: [{ json: result }] },
      });
    }

    messages.push({ role: "user", content: toolResultBlocks } as Message);
  }

  return "[Max tool rounds exceeded]";
}

多轮对话处理： 在对话轮次中维护

messages

数组，并使用对话历史构建数据集中的prompt字段：

typescript

const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];

for (let i = 0; i < scenario.turns.length; i++) {
  const userTurn = scenario.turns[i];
  messages.push({ role: "user", content: [{ text: userTurn }] });

  const assistantText = await converseLoop(messages, systemPrompt, tools);

  conversationHistory.push({ role: "user", text: userTurn });
  conversationHistory.push({ role: "assistant", text: assistantText });

  let prompt: string;
  if (i === 0) {
    prompt = userTurn;
  } else {
    prompt = conversationHistory
      .map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
      .join("\n");
  }

  entries.push({
    prompt,
    category: scenario.category,
    referenceResponse: "",
    modelResponses: [
      { response: assistantText, modelIdentifier: "my-app-v1" },
    ],
  });
}

Dataset JSONL Format

数据集JSONL格式

Each line must have this structure:

json

{
  "prompt": "User question or multi-turn history",
  "referenceResponse": "",
  "modelResponses": [
    {
      "response": "The model's actual output text",
      "modelIdentifier": "my-app-v1"
    }
  ]
}

Field	Required	Notes
`prompt`	Yes	User input. For multi-turn, concatenate: `User: ...\nAssistant: ...\nUser: ...`
`referenceResponse`	No	Expected/ideal response. Can be empty string. Needed for `Builtin.Correctness` and `Builtin.Completeness` to work properly. Maps to `{{ground_truth}}` template variable
`modelResponses`	Yes	Array with exactly one entry for pre-computed inference
`modelResponses[0].response`	Yes	The model's actual output text
`modelResponses[0].modelIdentifier`	Yes	Any string label. Must match `inferenceSourceIdentifier` in inference-config.json

Constraints: One model response per prompt. One unique

modelIdentifier

per job. Max 1000 prompts per job.

Write JSONL:

typescript

const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");

每一行必须符合以下结构：

json

{
  "prompt": "用户问题或多轮对话历史",
  "referenceResponse": "",
  "modelResponses": [
    {
      "response": "模型的实际输出文本",
      "modelIdentifier": "my-app-v1"
    }
  ]
}

字段	是否必填	说明
`prompt`	是	用户输入。对于多轮对话，需按以下格式拼接： `User: ...\nAssistant: ...\nUser: ...`
`referenceResponse`	否	预期/理想响应。可以为空字符串。 `Builtin.Correctness` 和 `Builtin.Completeness` 指标需要此字段才能正常工作。对应模板变量 `{{ground_truth}}`
`modelResponses`	是	数组，预计算推理场景下必须包含恰好一个条目
`modelResponses[0].response`	是	模型的实际输出文本
`modelResponses[0].modelIdentifier`	是	任意字符串标签。必须与inference-config.json中的 `inferenceSourceIdentifier` 匹配

约束条件： 每个提示词对应一个模型响应。每个任务使用一个唯一的

modelIdentifier

。每个任务最多支持1000个提示词。

写入JSONL：

typescript

const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");

Step 3: Design Metrics

步骤3：设计指标

Built-In Metrics

内置指标

Bedrock provides 11 built-in metrics requiring no configuration beyond listing them by name:

Metric Name	What It Measures
`Builtin.Correctness`	Is the factual content accurate? (works best with `referenceResponse` )
`Builtin.Completeness`	Does the response fully cover the request? (works best with `referenceResponse` )
`Builtin.Faithfulness`	Is the response faithful to the provided context/source?
`Builtin.Helpfulness`	Is the response useful, actionable, and cooperative?
`Builtin.Coherence`	Is the response logically structured and easy to follow?
`Builtin.Relevance`	Does the response address the actual question?
`Builtin.FollowingInstructions`	Does the response follow explicit instructions in the prompt?
`Builtin.ProfessionalStyleAndTone`	Is spelling, grammar, and tone appropriate?
`Builtin.Harmfulness`	Does the response contain harmful content?
`Builtin.Stereotyping`	Does the response contain stereotypes or bias?
`Builtin.Refusal`	Does the response appropriately refuse harmful requests?

Score interpretation:

1.0

= best,

0.0

= worst,

null

= N/A (judge could not evaluate).

Note:

referenceResponse

is needed for

Builtin.Correctness

and

Builtin.Completeness

to produce meaningful scores, since the judge compares against a reference baseline.

Bedrock提供11个内置指标，无需额外配置，只需按名称列出即可：

指标名称	衡量内容
`Builtin.Correctness`	事实内容是否准确？（配合 `referenceResponse` 使用效果最佳）
`Builtin.Completeness`	响应是否完全覆盖请求内容？（配合 `referenceResponse` 使用效果最佳）
`Builtin.Faithfulness`	响应是否与提供的上下文/来源一致？
`Builtin.Helpfulness`	响应是否有用、可操作且配合度高？
`Builtin.Coherence`	响应逻辑结构是否清晰、易于理解？
`Builtin.Relevance`	响应是否直接回应用户问题？
`Builtin.FollowingInstructions`	响应是否遵循提示词中的明确指令？
`Builtin.ProfessionalStyleAndTone`	拼写、语法和语气是否恰当？
`Builtin.Harmfulness`	响应是否包含有害内容？
`Builtin.Stereotyping`	响应是否包含刻板印象或偏见？
`Builtin.Refusal`	响应是否恰当地拒绝有害请求？

分数解读：

1.0

= 最佳，

0.0

= 最差，

null

= 不适用（评判者无法评估）。

注意：

referenceResponse

是

Builtin.Correctness

和

Builtin.Completeness

指标生成有意义分数的必要条件，因为评判者需要对比参考基准。

When to Use Custom Metrics

何时使用自定义指标

Use custom metrics to check domain-specific behaviors the built-in metrics don't cover. If you find yourself thinking "this scored well on Helpfulness but violated a critical business rule" — that's a custom metric.

Technique: Extract rules from your system prompt. Every rule in your system prompt is a candidate metric:

System prompt says:                          Candidate metric:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max"     → response_brevity
"Always greet returning users by name"    → personalized_greeting
"Never proceed without user confirmation" → confirmation_check
"Ask for missing details, don't assume"   → missing_info_followup

当内置指标无法覆盖领域特定行为时，使用自定义指标。如果您发现“这个响应在Helpfulness上得分很高，但违反了关键业务规则”，那么就需要自定义指标。

技巧：从系统提示词中提取规则。 系统提示词中的每一条规则都是自定义指标的候选：

系统提示词内容:                          候选指标:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max"     → 响应简洁性
"Always greet returning users by name"    → 个性化问候
"Never proceed without user confirmation" → 确认检查
"Ask for missing details, don't assume"   → 缺失信息跟进

Custom Metric JSON Anatomy

自定义指标JSON结构

json

{
  "customMetricDefinition": {
    "metricName": "my_metric_name",
    "instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}

Field	Details
`metricName`	Snake_case identifier. Must appear in BOTH `customMetrics` array AND `metricNames` array
`instructions`	Full prompt sent to the judge. Must include `{{prompt}}` and `{{prediction}}` template variables. Can also use `{{ground_truth}}` (maps to `referenceResponse` ). Input variables must come last in the prompt.
`ratingScale`	Array of rating levels. Each has a `definition` (label, max 5 words / 100 chars) and `value` with either `floatValue` or `stringValue`

Official constraints:

Max 10 custom metrics per job
Instructions max 5000 characters
Rating
```
definition
```
max 5 words / 100 characters
Input variables (
```
{{prompt}}
```
,
```
{{prediction}}
```
,
```
{{ground_truth}}
```
) must come last in the instruction text

json

{
  "customMetricDefinition": {
    "metricName": "my_metric_name",
    "instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}

字段	详情
`metricName`	蛇形命名标识符。必须同时出现在 `customMetrics` 数组和 `metricNames` 数组中
`instructions`	发送给评判者的完整提示词。必须包含 `{{prompt}}` 和 `{{prediction}}` 模板变量。也可以使用 `{{ground_truth}}` （对应 `referenceResponse` ）。输入变量必须放在提示词的最后。
`ratingScale`	评分等级数组。每个等级包含 `definition` （标签，最多5个单词/100个字符）和 `value` （包含 `floatValue` 或 `stringValue` ）

官方约束条件：

每个任务最多支持10个自定义指标
指令最多5000个字符
评分
```
definition
```
最多5个单词 / 100个字符
输入变量（
```
{{prompt}}
```
,
```
{{prediction}}
```
,
```
{{ground_truth}}
```
）必须放在指令文本的最后

Complete Custom Metric Example

完整自定义指标示例

A metric that checks whether the assistant follows a domain-specific rule, with N/A handling for irrelevant prompts:

json

{
  "customMetricDefinition": {
    "metricName": "confirmation_check",
    "instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "N/A", "value": { "floatValue": -1 } },
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}

When the judge selects N/A (

floatValue: -1

), Bedrock records

"result": null

. Your parser must handle

null

— treat as N/A and exclude from averages.

用于检查助手是否遵循领域特定规则的指标，包含对不适用场景的处理：

json

{
  "customMetricDefinition": {
    "metricName": "confirmation_check",
    "instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "N/A", "value": { "floatValue": -1 } },
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}

当评判者选择不适用（

floatValue: -1

）时，Bedrock会记录

"result": null

。您的解析器必须处理

null

值——将其视为不适用并排除在平均值计算之外。

Rating Scale Design

评分等级设计

3-4 levels for quality scales (Poor/Acceptable/Good/Excellent)
2 levels for binary checks (Poor/Good)
Add "N/A" level with
```
-1
```
for conditional metrics that only apply to certain prompt types
Rating values can use
```
floatValue
```
(numeric) or
```
stringValue
```
(text)

质量评分使用3-4个等级（差/可接受/好/优秀）
二元检查使用2个等级（差/好）
对于仅适用于特定提示词类型的条件指标，添加"不适用"等级并设置值为
```
-1
```
评分值可以使用
```
floatValue
```
（数值）或
```
stringValue
```
（文本）

Tips for Writing Metric Instructions

编写指标指令的技巧

Be explicit about what "good" and "bad" look like — include examples of phrases or behaviors
For conditional metrics, describe the N/A condition clearly so the judge doesn't score 0 when it should skip
Keep instructions under ~500 words to fit within context alongside prompt and response
Test with a few examples before running a full eval job

明确说明“好”和“差”的表现——包含短语或行为示例
对于条件指标，清晰描述不适用场景，避免评判者在应跳过评分时给出0分
指令控制在约500词以内，以便与提示词和响应一起放入上下文窗口
在运行完整评估任务前，先用几个示例测试指标

Step 4: AWS Infrastructure

步骤4：AWS基础设施

S3 Bucket

S3存储桶

bash

REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"

bash

REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"

us-east-1 does not accept LocationConstraint

us-east-1区域不需要LocationConstraint

if [ "${REGION}" = "us-east-1" ]; then aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}" else aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
--create-bucket-configuration LocationConstraint="${REGION}" fi


Upload the dataset:

```bash
aws s3 cp datasets/collected-responses.jsonl \
  "s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"


上传数据集：

```bash
aws s3 cp datasets/collected-responses.jsonl \
  "s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"

IAM Role

IAM角色

Trust policy (must include

aws:SourceAccount

condition — Bedrock rejects the role without it):

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "bedrock.amazonaws.com" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "YOUR_ACCOUNT_ID"
        }
      }
    }
  ]
}

Permissions policy:

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3DatasetRead",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET",
        "arn:aws:s3:::YOUR_BUCKET/datasets/*"
      ]
    },
    {
      "Sid": "S3ResultsWrite",
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
    },
    {
      "Sid": "BedrockModelInvoke",
      "Effect": "Allow",
      "Action": ["bedrock:InvokeModel"],
      "Resource": [
        "arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
      ]
    }
  ]
}

Replace

YOUR_BUCKET

REGION

, and

EVALUATOR_MODEL_ID

with actual values.

Create the role:

bash

ROLE_NAME="BedrockEvalRole"

ROLE_ARN=$(aws iam create-role \
  --role-name "${ROLE_NAME}" \
  --assume-role-policy-document file://trust-policy.json \
  --description "Allows Bedrock to run evaluation jobs" \
  --query "Role.Arn" --output text)

aws iam put-role-policy \
  --role-name "${ROLE_NAME}" \
  --policy-name "BedrockEvalPolicy" \
  --policy-document file://permissions-policy.json

信任策略（必须包含

aws:SourceAccount

条件——Bedrock会拒绝不包含该条件的角色）：

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "bedrock.amazonaws.com" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "YOUR_ACCOUNT_ID"
        }
      }
    }
  ]
}

权限策略：

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3DatasetRead",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET",
        "arn:aws:s3:::YOUR_BUCKET/datasets/*"
      ]
    },
    {
      "Sid": "S3ResultsWrite",
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
    },
    {
      "Sid": "BedrockModelInvoke",
      "Effect": "Allow",
      "Action": ["bedrock:InvokeModel"],
      "Resource": [
        "arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
      ]
    }
  ]
}

将

YOUR_BUCKET

、

REGION

和

EVALUATOR_MODEL_ID

替换为实际值。

创建角色：

bash

ROLE_NAME="BedrockEvalRole"

ROLE_ARN=$(aws iam create-role \
  --role-name "${ROLE_NAME}" \
  --assume-role-policy-document file://trust-policy.json \
  --description "Allows Bedrock to run evaluation jobs" \
  --query "Role.Arn" --output text)

aws iam put-role-policy \
  --role-name "${ROLE_NAME}" \
  --policy-name "BedrockEvalPolicy" \
  --policy-document file://permissions-policy.json

Step 5: Configure and Run Eval Job

步骤5：配置并运行评估任务

eval-config.json

json

{
  "automated": {
    "datasetMetricConfigs": [
      {
        "taskType": "General",
        "dataset": {
          "name": "my-eval-dataset",
          "datasetLocation": {
            "s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
          }
        },
        "metricNames": [
          "Builtin.Helpfulness",
          "Builtin.FollowingInstructions",
          "Builtin.ProfessionalStyleAndTone",
          "Builtin.Relevance",
          "Builtin.Completeness",
          "Builtin.Correctness",
          "my_custom_metric_1",
          "my_custom_metric_2"
        ]
      }
    ],
    "evaluatorModelConfig": {
      "bedrockEvaluatorModels": [
        { "modelIdentifier": "EVALUATOR_MODEL_ID" }
      ]
    },
    "customMetricConfig": {
      "customMetrics": [
        {
          "customMetricDefinition": {
            "metricName": "my_custom_metric_1",
            "instructions": "... {{prompt}} ... {{prediction}} ...",
            "ratingScale": [
              { "definition": "Poor", "value": { "floatValue": 0 } },
              { "definition": "Good", "value": { "floatValue": 1 } }
            ]
          }
        }
      ],
      "evaluatorModelConfig": {
        "bedrockEvaluatorModels": [
          { "modelIdentifier": "EVALUATOR_MODEL_ID" }
        ]
      }
    }
  }
}

Critical structure notes:

```
taskType
```
must be
```
"General"
```
(not "Generation" or any other value)
Custom metric names must appear in both
```
metricNames
```
array AND
```
customMetrics
```
array
```
evaluatorModelConfig
```
appears twice: once at the top level (for built-in metrics) and once inside
```
customMetricConfig
```
(for custom metrics) — both must specify the same evaluator model
```
modelIdentifier
```
must be the exact model ID string matching across all configs

json

{
  "automated": {
    "datasetMetricConfigs": [
      {
        "taskType": "General",
        "dataset": {
          "name": "my-eval-dataset",
          "datasetLocation": {
            "s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
          }
        },
        "metricNames": [
          "Builtin.Helpfulness",
          "Builtin.FollowingInstructions",
          "Builtin.ProfessionalStyleAndTone",
          "Builtin.Relevance",
          "Builtin.Completeness",
          "Builtin.Correctness",
          "my_custom_metric_1",
          "my_custom_metric_2"
        ]
      }
    ],
    "evaluatorModelConfig": {
      "bedrockEvaluatorModels": [
        { "modelIdentifier": "EVALUATOR_MODEL_ID" }
      ]
    },
    "customMetricConfig": {
      "customMetrics": [
        {
          "customMetricDefinition": {
            "metricName": "my_custom_metric_1",
            "instructions": "... {{prompt}} ... {{prediction}} ...",
            "ratingScale": [
              { "definition": "Poor", "value": { "floatValue": 0 } },
              { "definition": "Good", "value": { "floatValue": 1 } }
            ]
          }
        }
      ],
      "evaluatorModelConfig": {
        "bedrockEvaluatorModels": [
          { "modelIdentifier": "EVALUATOR_MODEL_ID" }
        ]
      }
    }
  }
}

关键结构说明：

```
taskType
```
必须设置为
```
"General"
```
（不能是"Generation"或其他值）
自定义指标名称必须同时出现在
```
metricNames
```
数组和
```
customMetrics
```
数组中
```
evaluatorModelConfig
```
出现两次：一次在顶层（用于内置指标），一次在
```
customMetricConfig
```
内部（用于自定义指标）——两者必须指定相同的评估者模型
```
modelIdentifier
```
必须与所有配置中的模型ID字符串完全匹配

inference-config.json

For pre-computed inference, this tells Bedrock that responses are already collected:

json

{
  "models": [
    {
      "precomputedInferenceSource": {
        "inferenceSourceIdentifier": "my-app-v1"
      }
    }
  ]
}

The

inferenceSourceIdentifier

must match the

modelIdentifier

in your JSONL dataset's

modelResponses

对于预计算推理，该配置告知Bedrock响应已提前收集：

json

{
  "models": [
    {
      "precomputedInferenceSource": {
        "inferenceSourceIdentifier": "my-app-v1"
      }
    }
  ]
}

inferenceSourceIdentifier

必须与JSONL数据集

modelResponses

中的

modelIdentifier

匹配。

Running the Job

运行任务

bash

aws bedrock create-evaluation-job \
  --job-name "my-eval-$(date +%Y%m%d-%H%M)" \
  --role-arn "${ROLE_ARN}" \
  --evaluation-config file://eval-config.json \
  --inference-config file://inference-config.json \
  --output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
  --region us-east-1

CLI notes:

Required params:

--job-name

--role-arn

--evaluation-config

--inference-config

--output-data-config

Optional:
```
--application-type
```
(e.g.,
```
ModelEvaluation
```
)
--job-name
constraint:
```
[a-z0-9](-*[a-z0-9]){0,62}
```
— lowercase + hyphens only, max 63 chars. Must be unique (use timestamps).
```
--evaluation-config
```
and
```
--inference-config
```
are document types — must use
```
file://
```
or inline JSON, no shorthand syntax
```
--output-data-config
```
is a structure — supports both inline JSON and shorthand (
```
s3Uri=string
```
)

bash

aws bedrock create-evaluation-job \
  --job-name "my-eval-$(date +%Y%m%d-%H%M)" \
  --role-arn "${ROLE_ARN}" \
  --evaluation-config file://eval-config.json \
  --inference-config file://inference-config.json \
  --output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
  --region us-east-1

CLI注意事项：

必填参数：

--job-name

--role-arn

--evaluation-config

--inference-config

--output-data-config

可选参数：
```
--application-type
```
（例如：
```
ModelEvaluation
```
）
--job-name
约束：
```
[a-z0-9](-*[a-z0-9]){0,62}
```
— 仅允许小写字母+连字符，最多63个字符。必须唯一（使用时间戳）。
```
--evaluation-config
```
和
```
--inference-config
```
为文档类型——必须使用
```
file://
```
或内联JSON，不支持简写语法
```
--output-data-config
```
为结构化参数——支持内联JSON和简写（
```
s3Uri=string
```
）

Monitoring

监控

bash

undefined

bash

undefined

List evaluation jobs (with optional filters)

列出评估任务（可添加筛选条件）

aws bedrock list-evaluation-jobs --region us-east-1 aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1 aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1

Get details for a specific job

获取特定任务的详情

aws bedrock get-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1

Cancel a running job

取消运行中的任务

aws bedrock stop-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1


**Job statuses:** `InProgress`, `Completed`, `Failed`, `Stopping`, `Stopped`, `Deleting`

Jobs typically take 5-15 minutes for 30-50 entry datasets. If a job fails, check `failureMessages` in the job details.

---

aws bedrock stop-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1


**任务状态：** `InProgress`, `Completed`, `Failed`, `Stopping`, `Stopped`, `Deleting`

对于包含30-50条条目的数据集，任务通常需要5-15分钟完成。如果任务失败，请检查任务详情中的`failureMessages`。

---

Step 6: Parse Results

步骤6：解析结果

S3 Output Directory Structure

S3输出目录结构

Bedrock writes results to a deeply nested path:

s3://YOUR_BUCKET/results/
  └── <job-name>/
      └── <job-name>/
          ├── amazon-bedrock-evaluations-permission-check   ← empty sentinel
          └── <random-id>/
              ├── custom_metrics/                            ← metric definitions (NOT results)
              └── models/
                  └── <model-identifier>/
                      └── taskTypes/General/datasets/<dataset-name>/
                          └── <uuid>_output.jsonl            ← actual results

The job name is repeated twice. The random ID changes every run. Use

aws s3 sync

— do not construct paths manually.

Bedrock会将结果写入深度嵌套的路径：

s3://YOUR_BUCKET/results/
  └── <job-name>/
      └── <job-name>/
          ├── amazon-bedrock-evaluations-permission-check   ← 空的标记文件
          └── <random-id>/
              ├── custom_metrics/                            ← 指标定义（非结果）
              └── models/
                  └── <model-identifier>/
                      └── taskTypes/General/datasets/<dataset-name>/
                          └── <uuid>_output.jsonl            ← 实际结果

任务名称会重复两次。随机ID每次运行都会变化。使用

aws s3 sync

同步——不要手动构造路径。

Download Results

下载结果

bash

aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1

bash

aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1

Result JSONL Format

结果JSONL格式

Each line:

json

{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Builtin.Helpfulness",
        "result": 0.6667,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "The response provides useful information..."
          }
        ]
      },
      {
        "metricName": "confirmation_check",
        "result": null,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "This conversation does not involve any consequential action..."
          }
        ]
      }
    ]
  },
  "inputRecord": {
    "prompt": "hello",
    "referenceResponse": "",
    "modelResponses": [
      { "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
    ]
  }
}

```
result
```
is a number (score) or
```
null
```
(N/A)
```
evaluatorDetails[0].explanation
```
contains the judge's written reasoning

每一行：

json

{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Builtin.Helpfulness",
        "result": 0.6667,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "The response provides useful information..."
          }
        ]
      },
      {
        "metricName": "confirmation_check",
        "result": null,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "This conversation does not involve any consequential action..."
          }
        ]
      }
    ]
  },
  "inputRecord": {
    "prompt": "hello",
    "referenceResponse": "",
    "modelResponses": [
      { "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
    ]
  }
}

```
result
```
为数值（分数）或
```
null
```
（不适用）
```
evaluatorDetails[0].explanation
```
包含评判者的书面推理过程

Parsing and Aggregation

解析与聚合

typescript

interface PromptResult {
  prompt: string;
  category: string;
  modelResponse: string;
  scores: Record<string, {
    score: string;
    reasoning?: string;
    rawScore?: number;
  }>;
}

for (const s of entry.automatedEvaluationResult.scores) {
  scores[s.metricName] = {
    score: s.result === null ? "N/A" : String(s.result),
    reasoning: s.evaluatorDetails?.[0]?.explanation,
    rawScore: typeof s.result === "number" ? s.result : undefined,
  };
}

Aggregation approach:

Overall averages per metric — exclude N/A entries
Per-category breakdown — group by category field, compute averages within each
Low-score alerts — flag entries below threshold (built-in < 0.5, custom <= 0)

Low-score alert format:

[Builtin.Relevance] score=0.50 | "hello..."
  Reason: The response does not directly address the greeting...

[confirmation_check] score=0.00 | "User: proceed with X..."
  Reason: The assistant executed the action without asking for confirmation...

typescript

interface PromptResult {
  prompt: string;
  category: string;
  modelResponse: string;
  scores: Record<string, {
    score: string;
    reasoning?: string;
    rawScore?: number;
  }>;
}

for (const s of entry.automatedEvaluationResult.scores) {
  scores[s.metricName] = {
    score: s.result === null ? "N/A" : String(s.result),
    reasoning: s.evaluatorDetails?.[0]?.explanation,
    rawScore: typeof s.result === "number" ? s.result : undefined,
  };
}

聚合方法：

各指标的总体平均值 — 排除不适用条目
按类别细分 — 按category字段分组，计算每个类别内的平均值
低分告警 — 标记低于阈值的条目（内置指标<0.5，自定义指标<=0）

低分告警格式：

[Builtin.Relevance] score=0.50 | "hello..."
  Reason: The response does not directly address the greeting...

[confirmation_check] score=0.00 | "User: proceed with X..."
  Reason: The assistant executed the action without asking for confirmation...

Step 7: Eval-Fix-Reeval Loop

步骤7：评估-修复-重新评估循环

Common Fixes

常见修复措施

Finding	Fix
Low brevity scores	Add hard constraint: "Respond in no more than 3 sentences."
Low confirmation_check	Add: "Before executing, summarize details and ask for confirmation."
Low missing_info_followup	Add: "If any required field is missing, ask for it. Do not assume."
Low tone on negative outcomes	Add empathy instructions for bad-news scenarios
Low Completeness on simple prompts	Metric/data issue — add `referenceResponse` or filter from Completeness

发现问题	修复方案
简洁性分数低	添加硬性约束："Respond in no more than 3 sentences."
confirmation_check分数低	添加："Before executing, summarize details and ask for confirmation."
缺失信息跟进分数低	添加："If any required field is missing, ask for it. Do not assume."
负面结果的语气分数低	为坏消息场景添加共情指令
简单提示词的完整性分数低	指标/数据问题——添加 `referenceResponse` 或从完整性指标中过滤此类提示词

Metric Refinement

指标优化

High N/A rates (>60%) — metric too narrowly scoped. Split dataset or adjust scope.
All-high scores — instructions too lenient. Add specific failure criteria.
Inconsistent scoring — instructions ambiguous. Add concrete examples per rating level.

高不适用率（>60%）——指标范围过窄。拆分数据集或调整指标范围。
全高分——指令过于宽松。添加具体的失败判定标准。
评分不一致——指令模糊。为每个评分等级添加具体示例。

Run Comparison

运行对比

Run 1 (baseline):    response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes):  response_brevity avg=0.85, custom_tone avg=0.90

Track scores over time. The pipeline's value comes from repeated measurement.

Run 1 (baseline):    response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes):  response_brevity avg=0.85, custom_tone avg=0.90

随时间跟踪分数变化。流水线的价值在于反复测量。

Gotchas

注意事项

taskType
must be
"General"
— not "Generation" or any other value. The job fails silently with other values.
Custom metric names in BOTH places — must appear in
```
metricNames
```
array AND
```
customMetrics
```
array. Missing from
```
metricNames
```
= silently ignored. Missing from
```
customMetrics
```
= job fails.

null
result means N/A, not 0 — when the judge determines a metric doesn't apply, Bedrock records

null

typescript

// WRONG — treats N/A as 0
const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length;

// RIGHT — excludes N/A from average
const numericScores = scores.filter((s): s is number => s !== null);
const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length;

evaluatorModelConfig
appears twice — once at top level (built-in metrics), once inside
```
customMetricConfig
```
(custom metrics). Omitting either causes those metrics to fail.
modelIdentifier
must match exactly — the string in JSONL
```
modelResponses
```
must be character-for-character identical to
```
inferenceSourceIdentifier
```
in inference-config.json. Mismatch = model mapping error.
AWS CLI 2.33+ required — older versions silently drop
```
customMetricConfig
```
and
```
precomputedInferenceSource
```
. Job creation succeeds but the job fails. Always check
```
aws --version
```
.
Job names: lowercase + hyphens, max 63 chars — pattern:
```
[a-z0-9](-*[a-z0-9]){0,62}
```
. Must be unique across all jobs. Use timestamps:
```
--job-name "my-eval-$(date +%Y%m%d-%H%M)"
```
.
S3 output is deeply nested —
```
<prefix>/<job-name>/<job-name>/<random-id>/models/...
```
. Use
```
aws s3 sync
```
and search for
```
_output.jsonl
```
. Do not construct paths manually.
referenceResponse
improves Correctness/Completeness — empty string is valid, but providing reference responses gives the judge a baseline for comparison.
<thinking>
tag leakage (model-specific) — some models (e.g., Amazon Nova Lite) may leak
```
<thinking>...</thinking>
```
blocks into responses. If present, strip before writing JSONL:
typescript
```
const clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim();
```
us-east-1 S3 bucket creation — do NOT pass
```
LocationConstraint
```
for
```
us-east-1
```
. Other regions require it.

taskType
必须设置为
"General"
— 不能是"Generation"或其他值。设置为其他值会导致任务静默失败。
自定义指标名称必须出现在两个位置 — 必须同时出现在
```
metricNames
```
数组和
```
customMetrics
```
数组中。未出现在
```
metricNames
```
中会被静默忽略。未出现在
```
customMetrics
```
中会导致任务失败。

null
结果表示不适用，而非0分 — 当评判者判定指标不适用时，Bedrock会记录

null

：

typescript

// 错误 — 将不适用视为0分
const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length;

// 正确 — 排除不适用条目计算平均值
const numericScores = scores.filter((s): s is number => s !== null);
const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length;

evaluatorModelConfig
出现两次 — 一次在顶层（用于内置指标），一次在
```
customMetricConfig
```
内部（用于自定义指标）。省略任何一处都会导致对应指标失败。
modelIdentifier
必须完全匹配 — JSONL的
```
modelResponses
```
中的字符串必须与inference-config.json中的
```
inferenceSourceIdentifier
```
完全一致。不匹配会导致模型映射错误。
需要AWS CLI 2.33+版本 — 旧版本会静默丢弃
```
customMetricConfig
```
和
```
precomputedInferenceSource
```
。任务创建会成功，但任务会失败。请始终检查
```
aws --version
```
。
任务名称：小写字母+连字符，最多63个字符 — 格式：
```
[a-z0-9](-*[a-z0-9]){0,62}
```
。所有任务中必须唯一。使用时间戳：
```
--job-name "my-eval-$(date +%Y%m%d-%H%M)"
```
。
S3输出路径深度嵌套 —
```
<prefix>/<job-name>/<job-name>/<random-id>/models/...
```
。使用
```
aws s3 sync
```
同步并搜索
```
_output.jsonl
```
。不要手动构造路径。
referenceResponse
提升正确性/完整性指标效果 — 空字符串是合法的，但提供参考响应能为评判者提供对比基准。
<thinking>
标签泄露（模型特定） — 部分模型（例如Amazon Nova Lite）可能会在响应中泄露
```
<thinking>...</thinking>
```
块。如果存在此类内容，写入JSONL前需去除：
typescript
```
const clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim();
```
us-east-1区域S3存储桶创建 — 不要为
```
us-east-1
```
区域传递
```
LocationConstraint
```
。其他区域需要该参数。

Cost Estimation

成本估算

Formula:

Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price

Example: 30 prompts, 10 metrics, Nova Pro judge:

Response collection (Nova Lite): ~$0.02
Evaluation job (Nova Pro): ~$0.58
Total per run: ~$0.61

Scaling: Cost is linear with prompts and metrics. 100 prompts x 10 metrics ≈ $5. Judge cost dominates at ~95%. Adding 1 custom metric adds ~$0.06/run (30 prompts, Nova Pro).

公式：

Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price

示例： 30个提示词，10个指标，使用Nova Pro作为评判者：

响应收集（Nova Lite）：~$0.02
评估任务（Nova Pro）：~$0.58
每次运行总成本：~$0.61

扩展： 成本与提示词数量和指标数量呈线性关系。100个提示词×10个指标≈$5。评判者成本占比约95%。添加1个自定义指标会使每次运行成本增加约$0.06（30个提示词，Nova Pro）。

References

参考资料

Model Evaluation Metrics — all 11 built-in metrics
Custom Metrics Prompt Formats —
```
metricName
```
, template variables, constraints
Prompt Datasets for Judge Evaluation — dataset JSONL format
CreateEvaluationJob API Reference — full API spec
AWS CLI create-evaluation-job — CLI command reference
Amazon Bedrock Pricing — model pricing

Model Evaluation Metrics — 所有11个内置指标
Custom Metrics Prompt Formats —
```
metricName
```
、模板变量、约束条件
Prompt Datasets for Judge Evaluation — 数据集JSONL格式
CreateEvaluationJob API Reference — 完整API规范
AWS CLI create-evaluation-job — CLI命令参考
Amazon Bedrock Pricing — 模型定价