aws-bedrock-evals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AWS Bedrock Evaluation Jobs

AWS Bedrock Evaluation Jobs



Overview

概述

Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.
Pre-computed Inference vs Live Inference
ModeHow it worksUse when
Live InferenceBedrock generates responses during the eval jobSimple prompt-in/text-out, no tool calling
Pre-computed InferenceYou pre-collect responses and supply them in a JSONL datasetTool calling, multi-turn conversations, custom orchestration, models outside Bedrock
Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.
Pipeline
Design Scenarios → Collect Responses → Upload to S3 → Run Eval Job → Parse Results → Act on Findings
       |                  |                  |               |               |               |
  scenarios.json    Your app's API     s3://bucket/     create-         s3 sync +       Fix prompt,
  (multi-turn)      → dataset JSONL    datasets/        evaluation-job  parse JSONL      retune metrics

Amazon Bedrock Evaluation Jobs通过使用独立的评估者模型(即“评判者”),根据一组指标对提示词-响应配对进行评分,以此衡量基于Bedrock构建的应用的性能。评判者会结合特定指标的指令读取每一对内容,并给出数值评分和书面推理过程。
预计算推理 vs 实时推理
模式工作原理适用场景
实时推理Bedrock在评估任务执行过程中生成响应简单的提示词输入-文本输出场景,无需工具调用
预计算推理您预先收集响应并以JSONL数据集的形式提供涉及工具调用、多轮对话、自定义编排、或使用Bedrock之外模型的场景
当您的应用涉及工具使用、Agent循环、多轮状态管理或外部编排时,使用预计算推理
流水线流程
设计测试场景 → 收集响应 → 上传至S3 → 运行评估任务 → 解析结果 → 根据发现优化
       |                  |                  |               |               |               |
  scenarios.json    您的应用API     s3://bucket/     create-         s3同步 +       优化提示词,
  (多轮对话)      → 数据集JSONL    datasets/        evaluation-job  解析JSONL      调整指标

Agent Behavior: Gather Inputs and Show Cost Estimate

Agent行为:收集输入并展示成本估算

Before generating any configs, scripts, or artifacts, you MUST gather the following from the user:
  1. AWS Region — Which region to use (default:
    us-east-1
    ). Affects model availability and pricing.
  2. Target model — The model their application uses (e.g.,
    amazon.nova-lite-v1:0
    ,
    anthropic.claude-3-haiku
    ).
  3. Evaluator (judge) model — The model to score responses (e.g.,
    amazon.nova-pro-v1:0
    ). Should be at least as capable as the target.
  4. Application type — Brief description of what the app does. Used to design test scenarios and derive custom metrics.
  5. Number of test scenarios — How many they plan to test (recommend 13-20 for first run).
  6. Estimated JSONL entries — Derived from scenarios x avg turns per scenario.
  7. Number of metrics — Total (built-in + custom). Recommend starting with 6 built-in + 3-5 custom.
  8. S3 bucket — Existing bucket name or confirm creation of a new one.
  9. IAM role — Existing role ARN or confirm creation of a new one.
在生成任何配置、脚本或工件之前,您必须向用户收集以下信息:
  1. AWS区域 — 使用的区域(默认值:
    us-east-1
    )。会影响模型可用性和定价。
  2. 目标模型 — 应用所使用的模型(例如:
    amazon.nova-lite-v1:0
    ,
    anthropic.claude-3-haiku
    )。
  3. 评估者(评判者)模型 — 用于对响应评分的模型(例如:
    amazon.nova-pro-v1:0
    )。其能力应至少与目标模型相当。
  4. 应用类型 — 应用功能的简要描述。用于设计测试场景和推导自定义指标。
  5. 测试场景数量 — 计划测试的场景数量(首次运行建议13-20个)。
  6. 预估JSONL条目数 — 由场景数 × 平均每场景对话轮次得出。
  7. 指标数量 — 总数量(内置指标 + 自定义指标)。建议从6个内置指标 + 3-5个自定义指标开始。
  8. S3存储桶 — 现有存储桶名称,或确认是否创建新存储桶。
  9. IAM角色 — 现有角色ARN,或确认是否创建新角色。

Cost Estimate

成本估算

After gathering inputs, you MUST display a cost estimate before proceeding:
undefined
收集完输入信息后,您必须先展示成本估算,再继续后续操作:
undefined

Estimated Cost Summary

预估成本汇总

ItemDetailsEst. Cost
Response collection{N} prompts x ~{T} tokens x {target_model_price}${X.XX}
Evaluation job{N} prompts x {M} metrics x ~1,700 tokens x {judge_model_price}${X.XX}
S3 storage< 1 MB< $0.01
Total per run~${X.XX}
Scaling: Each additional run costs ~${X.XX}. Adding 1 custom metric adds ~${Y.YY}/run.

**Cost formulas:**
- **Response collection**: `num_prompts x avg_input_tokens x input_price + num_prompts x avg_output_tokens x output_price`
- **Evaluation job**: `num_prompts x num_metrics x ~1,500 input_tokens x judge_input_price + num_prompts x num_metrics x ~200 output_tokens x judge_output_price`

**Model pricing reference:**

| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|-------|----------------------|------------------------|
| amazon.nova-lite-v1:0 | $0.06 | $0.24 |
| amazon.nova-pro-v1:0 | $0.80 | $3.20 |
| anthropic.claude-3-haiku | $0.25 | $1.25 |
| anthropic.claude-3-sonnet | $3.00 | $15.00 |

---
项目详情预估成本
响应收集{N}个提示词 × ~{T}个token × {target_model_price}${X.XX}
评估任务{N}个提示词 × {M}个指标 × ~1,700个token × {judge_model_price}${X.XX}
S3存储< 1 MB< $0.01
每次运行总成本~${X.XX}
扩展说明:每额外运行一次成本约为${X.XX}。添加1个自定义指标会使每次运行成本增加约${Y.YY}。

**成本计算公式:**
- **响应收集**:`提示词数量 × 平均输入token数 × 输入价格 + 提示词数量 × 平均输出token数 × 输出价格`
- **评估任务**:`提示词数量 × 指标数量 × ~1,500个输入token × 评判者模型输入价格 + 提示词数量 × 指标数量 × ~200个输出token × 评判者模型输出价格`

**模型定价参考:**

| 模型 | 输入(每100万token) | 输出(每100万token) |
|-------|----------------------|------------------------|
| amazon.nova-lite-v1:0 | $0.06 | $0.24 |
| amazon.nova-pro-v1:0 | $0.80 | $3.20 |
| anthropic.claude-3-haiku | $0.25 | $1.25 |
| anthropic.claude-3-sonnet | $3.00 | $15.00 |

---

Prerequisites

前置条件

bash
undefined
bash
undefined

AWS CLI 2.33+ required (older versions silently drop customMetricConfig/precomputedInferenceSource fields)

需要AWS CLI 2.33+版本(旧版本会静默丢弃customMetricConfig/precomputedInferenceSource字段)

aws --version
aws --version

Verify target model access

验证目标模型权限

aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION
aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION

Verify evaluator model access

验证评估者模型权限

aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION

Good evaluator model choices: `amazon.nova-pro-v1:0`, `anthropic.claude-3-sonnet`, `anthropic.claude-3-haiku`. The evaluator should be at least as capable as your target model.

---
aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION

推荐的评估者模型:`amazon.nova-pro-v1:0`, `anthropic.claude-3-sonnet`, `anthropic.claude-3-haiku`。评估者模型的能力应至少与您的目标模型相当。

---

Step 1: Design Test Scenarios

步骤1:设计测试场景

List the application's functional areas (e.g., greeting, booking-flow, error-handling, etc.). Each category should have 2-4 scenarios covering happy path and edge cases.
Scenario JSON format:
json
[
  {
    "id": "greeting-known-user",
    "category": "greeting",
    "context": { "userId": "user-123" },
    "turns": ["hello"]
  },
  {
    "id": "multi-step-flow",
    "category": "core-flow",
    "context": { "userId": "user-456" },
    "turns": [
      "hello",
      "I need help with X",
      "yes, proceed with that",
      "thanks"
    ]
  }
]
The
context
field holds any session/user data your app needs. Each turn in the array is one user message; the collection step handles the multi-turn conversation loop.
Edge case coverage dimensions:
  • Happy path: standard usage that should work perfectly
  • Missing information: user omits required fields
  • Unavailable resources: requested item doesn't exist
  • Out-of-scope requests: user asks something the app shouldn't handle
  • Error recovery: bad input, invalid data
  • Tone stress tests: complaints, frustration
Recommended count: 13-20 scenarios producing 30-50 JSONL entries (multi-turn scenarios produce one entry per turn).

列出应用的功能领域(例如:问候、预订流程、错误处理等)。每个类别应包含2-4个场景,覆盖正常路径和边缘情况。
场景JSON格式:
json
[
  {
    "id": "greeting-known-user",
    "category": "greeting",
    "context": { "userId": "user-123" },
    "turns": ["hello"]
  },
  {
    "id": "multi-step-flow",
    "category": "core-flow",
    "context": { "userId": "user-456" },
    "turns": [
      "hello",
      "I need help with X",
      "yes, proceed with that",
      "thanks"
    ]
  }
]
context
字段存储应用所需的会话/用户数据。数组中的每一轮对话对应一条用户消息;收集步骤会处理多轮对话循环。
边缘情况覆盖维度:
  • 正常路径:应完美运行的标准使用场景
  • 信息缺失:用户遗漏必填字段
  • 资源不可用:请求的项目不存在
  • 超出范围的请求:用户询问应用不应处理的内容
  • 错误恢复:无效输入、错误数据
  • 语气压力测试:投诉、不满情绪
推荐数量: 13-20个场景,生成30-50条JSONL条目(多轮场景每轮对话生成一条条目)。

Step 2: Collect Responses

步骤2:收集响应

Collect responses from your application however it runs. The goal is to produce a JSONL dataset file where each line contains the prompt, the model's response, and metadata.
Example pattern: Converse API with tool-calling loop (TypeScript)
This applies when your application uses Bedrock with tool calling:
typescript
import {
  BedrockRuntimeClient,
  ConverseCommand,
  type Message,
  type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

async function converseLoop(
  messages: Message[],
  systemPrompt: SystemContentBlock[],
  tools: any[]
): Promise<string> {
  const MAX_TOOL_ROUNDS = 10;

  for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
    const response = await client.send(
      new ConverseCommand({
        modelId: "TARGET_MODEL_ID",
        system: systemPrompt,
        messages,
        toolConfig: { tools },
        inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
      })
    );

    const assistantContent = response.output?.message?.content as any[];
    if (!assistantContent) return "[No response from model]";

    messages.push({ role: "assistant", content: assistantContent });

    const toolUseBlocks = assistantContent.filter(
      (block: any) => block.toolUse != null
    );

    if (toolUseBlocks.length === 0) {
      return assistantContent
        .filter((block: any) => block.text != null)
        .map((block: any) => block.text as string)
        .join("\n") || "[Empty response]";
    }

    const toolResultBlocks: any[] = [];
    for (const block of toolUseBlocks) {
      const { toolUseId, name, input } = block.toolUse;
      const result = await executeTool(name, input);
      toolResultBlocks.push({
        toolResult: { toolUseId, content: [{ json: result }] },
      });
    }

    messages.push({ role: "user", content: toolResultBlocks } as Message);
  }

  return "[Max tool rounds exceeded]";
}
Multi-turn handling: Maintain the
messages
array across turns and build the dataset prompt field with conversation history:
typescript
const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];

for (let i = 0; i < scenario.turns.length; i++) {
  const userTurn = scenario.turns[i];
  messages.push({ role: "user", content: [{ text: userTurn }] });

  const assistantText = await converseLoop(messages, systemPrompt, tools);

  conversationHistory.push({ role: "user", text: userTurn });
  conversationHistory.push({ role: "assistant", text: assistantText });

  let prompt: string;
  if (i === 0) {
    prompt = userTurn;
  } else {
    prompt = conversationHistory
      .map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
      .join("\n");
  }

  entries.push({
    prompt,
    category: scenario.category,
    referenceResponse: "",
    modelResponses: [
      { response: assistantText, modelIdentifier: "my-app-v1" },
    ],
  });
}
通过应用的运行方式收集响应。目标是生成一个JSONL数据集文件,其中每一行包含提示词、模型响应和元数据。
示例模式:带工具调用循环的Converse API(TypeScript)
适用于您的应用使用Bedrock进行工具调用的场景:
typescript
import {
  BedrockRuntimeClient,
  ConverseCommand,
  type Message,
  type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

async function converseLoop(
  messages: Message[],
  systemPrompt: SystemContentBlock[],
  tools: any[]
): Promise<string> {
  const MAX_TOOL_ROUNDS = 10;

  for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
    const response = await client.send(
      new ConverseCommand({
        modelId: "TARGET_MODEL_ID",
        system: systemPrompt,
        messages,
        toolConfig: { tools },
        inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
      })
    );

    const assistantContent = response.output?.message?.content as any[];
    if (!assistantContent) return "[No response from model]";

    messages.push({ role: "assistant", content: assistantContent });

    const toolUseBlocks = assistantContent.filter(
      (block: any) => block.toolUse != null
    );

    if (toolUseBlocks.length === 0) {
      return assistantContent
        .filter((block: any) => block.text != null)
        .map((block: any) => block.text as string)
        .join("\n") || "[Empty response]";
    }

    const toolResultBlocks: any[] = [];
    for (const block of toolUseBlocks) {
      const { toolUseId, name, input } = block.toolUse;
      const result = await executeTool(name, input);
      toolResultBlocks.push({
        toolResult: { toolUseId, content: [{ json: result }] },
      });
    }

    messages.push({ role: "user", content: toolResultBlocks } as Message);
  }

  return "[Max tool rounds exceeded]";
}
多轮对话处理: 在对话轮次中维护
messages
数组,并使用对话历史构建数据集中的prompt字段:
typescript
const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];

for (let i = 0; i < scenario.turns.length; i++) {
  const userTurn = scenario.turns[i];
  messages.push({ role: "user", content: [{ text: userTurn }] });

  const assistantText = await converseLoop(messages, systemPrompt, tools);

  conversationHistory.push({ role: "user", text: userTurn });
  conversationHistory.push({ role: "assistant", text: assistantText });

  let prompt: string;
  if (i === 0) {
    prompt = userTurn;
  } else {
    prompt = conversationHistory
      .map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
      .join("\n");
  }

  entries.push({
    prompt,
    category: scenario.category,
    referenceResponse: "",
    modelResponses: [
      { response: assistantText, modelIdentifier: "my-app-v1" },
    ],
  });
}

Dataset JSONL Format

数据集JSONL格式

Each line must have this structure:
json
{
  "prompt": "User question or multi-turn history",
  "referenceResponse": "",
  "modelResponses": [
    {
      "response": "The model's actual output text",
      "modelIdentifier": "my-app-v1"
    }
  ]
}
FieldRequiredNotes
prompt
YesUser input. For multi-turn, concatenate:
User: ...\nAssistant: ...\nUser: ...
referenceResponse
NoExpected/ideal response. Can be empty string. Needed for
Builtin.Correctness
and
Builtin.Completeness
to work properly. Maps to
{{ground_truth}}
template variable
modelResponses
YesArray with exactly one entry for pre-computed inference
modelResponses[0].response
YesThe model's actual output text
modelResponses[0].modelIdentifier
YesAny string label. Must match
inferenceSourceIdentifier
in inference-config.json
Constraints: One model response per prompt. One unique
modelIdentifier
per job. Max 1000 prompts per job.
Write JSONL:
typescript
const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");

每一行必须符合以下结构:
json
{
  "prompt": "用户问题或多轮对话历史",
  "referenceResponse": "",
  "modelResponses": [
    {
      "response": "模型的实际输出文本",
      "modelIdentifier": "my-app-v1"
    }
  ]
}
字段是否必填说明
prompt
用户输入。对于多轮对话,需按以下格式拼接:
User: ...\nAssistant: ...\nUser: ...
referenceResponse
预期/理想响应。可以为空字符串。
Builtin.Correctness
Builtin.Completeness
指标需要此字段才能正常工作。对应模板变量
{{ground_truth}}
modelResponses
数组,预计算推理场景下必须包含恰好一个条目
modelResponses[0].response
模型的实际输出文本
modelResponses[0].modelIdentifier
任意字符串标签。必须与inference-config.json中的
inferenceSourceIdentifier
匹配
约束条件: 每个提示词对应一个模型响应。每个任务使用一个唯一的
modelIdentifier
。每个任务最多支持1000个提示词。
写入JSONL:
typescript
const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");

Step 3: Design Metrics

步骤3:设计指标

Built-In Metrics

内置指标

Bedrock provides 11 built-in metrics requiring no configuration beyond listing them by name:
Metric NameWhat It Measures
Builtin.Correctness
Is the factual content accurate? (works best with
referenceResponse
)
Builtin.Completeness
Does the response fully cover the request? (works best with
referenceResponse
)
Builtin.Faithfulness
Is the response faithful to the provided context/source?
Builtin.Helpfulness
Is the response useful, actionable, and cooperative?
Builtin.Coherence
Is the response logically structured and easy to follow?
Builtin.Relevance
Does the response address the actual question?
Builtin.FollowingInstructions
Does the response follow explicit instructions in the prompt?
Builtin.ProfessionalStyleAndTone
Is spelling, grammar, and tone appropriate?
Builtin.Harmfulness
Does the response contain harmful content?
Builtin.Stereotyping
Does the response contain stereotypes or bias?
Builtin.Refusal
Does the response appropriately refuse harmful requests?
Score interpretation:
1.0
= best,
0.0
= worst,
null
= N/A (judge could not evaluate).
Note:
referenceResponse
is needed for
Builtin.Correctness
and
Builtin.Completeness
to produce meaningful scores, since the judge compares against a reference baseline.
Bedrock提供11个内置指标,无需额外配置,只需按名称列出即可:
指标名称衡量内容
Builtin.Correctness
事实内容是否准确?(配合
referenceResponse
使用效果最佳)
Builtin.Completeness
响应是否完全覆盖请求内容?(配合
referenceResponse
使用效果最佳)
Builtin.Faithfulness
响应是否与提供的上下文/来源一致?
Builtin.Helpfulness
响应是否有用、可操作且配合度高?
Builtin.Coherence
响应逻辑结构是否清晰、易于理解?
Builtin.Relevance
响应是否直接回应用户问题?
Builtin.FollowingInstructions
响应是否遵循提示词中的明确指令?
Builtin.ProfessionalStyleAndTone
拼写、语法和语气是否恰当?
Builtin.Harmfulness
响应是否包含有害内容?
Builtin.Stereotyping
响应是否包含刻板印象或偏见?
Builtin.Refusal
响应是否恰当地拒绝有害请求?
分数解读:
1.0
= 最佳,
0.0
= 最差,
null
= 不适用(评判者无法评估)。
注意:
referenceResponse
Builtin.Correctness
Builtin.Completeness
指标生成有意义分数的必要条件,因为评判者需要对比参考基准。

When to Use Custom Metrics

何时使用自定义指标

Use custom metrics to check domain-specific behaviors the built-in metrics don't cover. If you find yourself thinking "this scored well on Helpfulness but violated a critical business rule" — that's a custom metric.
Technique: Extract rules from your system prompt. Every rule in your system prompt is a candidate metric:
System prompt says:                          Candidate metric:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max"     → response_brevity
"Always greet returning users by name"    → personalized_greeting
"Never proceed without user confirmation" → confirmation_check
"Ask for missing details, don't assume"   → missing_info_followup
当内置指标无法覆盖领域特定行为时,使用自定义指标。如果您发现“这个响应在Helpfulness上得分很高,但违反了关键业务规则”,那么就需要自定义指标。
技巧:从系统提示词中提取规则。 系统提示词中的每一条规则都是自定义指标的候选:
系统提示词内容:                          候选指标:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max"     → 响应简洁性
"Always greet returning users by name"    → 个性化问候
"Never proceed without user confirmation" → 确认检查
"Ask for missing details, don't assume"   → 缺失信息跟进

Custom Metric JSON Anatomy

自定义指标JSON结构

json
{
  "customMetricDefinition": {
    "metricName": "my_metric_name",
    "instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}
FieldDetails
metricName
Snake_case identifier. Must appear in BOTH
customMetrics
array AND
metricNames
array
instructions
Full prompt sent to the judge. Must include
{{prompt}}
and
{{prediction}}
template variables. Can also use
{{ground_truth}}
(maps to
referenceResponse
). Input variables must come last in the prompt.
ratingScale
Array of rating levels. Each has a
definition
(label, max 5 words / 100 chars) and
value
with either
floatValue
or
stringValue
Official constraints:
  • Max 10 custom metrics per job
  • Instructions max 5000 characters
  • Rating
    definition
    max 5 words / 100 characters
  • Input variables (
    {{prompt}}
    ,
    {{prediction}}
    ,
    {{ground_truth}}
    ) must come last in the instruction text
json
{
  "customMetricDefinition": {
    "metricName": "my_metric_name",
    "instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}
字段详情
metricName
蛇形命名标识符。必须同时出现在
customMetrics
数组和
metricNames
数组中
instructions
发送给评判者的完整提示词。必须包含
{{prompt}}
{{prediction}}
模板变量。也可以使用
{{ground_truth}}
(对应
referenceResponse
)。输入变量必须放在提示词的最后。
ratingScale
评分等级数组。每个等级包含
definition
(标签,最多5个单词/100个字符)和
value
(包含
floatValue
stringValue
官方约束条件:
  • 每个任务最多支持10个自定义指标
  • 指令最多5000个字符
  • 评分
    definition
    最多5个单词 / 100个字符
  • 输入变量(
    {{prompt}}
    ,
    {{prediction}}
    ,
    {{ground_truth}}
    )必须放在指令文本的最后

Complete Custom Metric Example

完整自定义指标示例

A metric that checks whether the assistant follows a domain-specific rule, with N/A handling for irrelevant prompts:
json
{
  "customMetricDefinition": {
    "metricName": "confirmation_check",
    "instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "N/A", "value": { "floatValue": -1 } },
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}
When the judge selects N/A (
floatValue: -1
), Bedrock records
"result": null
. Your parser must handle
null
— treat as N/A and exclude from averages.
用于检查助手是否遵循领域特定规则的指标,包含对不适用场景的处理:
json
{
  "customMetricDefinition": {
    "metricName": "confirmation_check",
    "instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "N/A", "value": { "floatValue": -1 } },
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}
当评判者选择不适用(
floatValue: -1
)时,Bedrock会记录
"result": null
。您的解析器必须处理
null
值——将其视为不适用并排除在平均值计算之外。

Rating Scale Design

评分等级设计

  • 3-4 levels for quality scales (Poor/Acceptable/Good/Excellent)
  • 2 levels for binary checks (Poor/Good)
  • Add "N/A" level with
    -1
    for conditional metrics that only apply to certain prompt types
  • Rating values can use
    floatValue
    (numeric) or
    stringValue
    (text)
  • 质量评分使用3-4个等级(差/可接受/好/优秀)
  • 二元检查使用2个等级(差/好)
  • 对于仅适用于特定提示词类型的条件指标,添加"不适用"等级并设置值为
    -1
  • 评分值可以使用
    floatValue
    (数值)或
    stringValue
    (文本)

Tips for Writing Metric Instructions

编写指标指令的技巧

  • Be explicit about what "good" and "bad" look like — include examples of phrases or behaviors
  • For conditional metrics, describe the N/A condition clearly so the judge doesn't score 0 when it should skip
  • Keep instructions under ~500 words to fit within context alongside prompt and response
  • Test with a few examples before running a full eval job

  • 明确说明“好”和“差”的表现——包含短语或行为示例
  • 对于条件指标,清晰描述不适用场景,避免评判者在应跳过评分时给出0分
  • 指令控制在约500词以内,以便与提示词和响应一起放入上下文窗口
  • 在运行完整评估任务前,先用几个示例测试指标

Step 4: AWS Infrastructure

步骤4:AWS基础设施

S3 Bucket

S3存储桶

bash
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"
bash
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"

us-east-1 does not accept LocationConstraint

us-east-1区域不需要LocationConstraint

if [ "${REGION}" = "us-east-1" ]; then aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}" else aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
--create-bucket-configuration LocationConstraint="${REGION}" fi

Upload the dataset:

```bash
aws s3 cp datasets/collected-responses.jsonl \
  "s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"
if [ "${REGION}" = "us-east-1" ]; then aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}" else aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
--create-bucket-configuration LocationConstraint="${REGION}" fi

上传数据集:

```bash
aws s3 cp datasets/collected-responses.jsonl \
  "s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"

IAM Role

IAM角色

Trust policy (must include
aws:SourceAccount
condition — Bedrock rejects the role without it):
json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "bedrock.amazonaws.com" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "YOUR_ACCOUNT_ID"
        }
      }
    }
  ]
}
Permissions policy:
json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3DatasetRead",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET",
        "arn:aws:s3:::YOUR_BUCKET/datasets/*"
      ]
    },
    {
      "Sid": "S3ResultsWrite",
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
    },
    {
      "Sid": "BedrockModelInvoke",
      "Effect": "Allow",
      "Action": ["bedrock:InvokeModel"],
      "Resource": [
        "arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
      ]
    }
  ]
}
Replace
YOUR_BUCKET
,
REGION
, and
EVALUATOR_MODEL_ID
with actual values.
Create the role:
bash
ROLE_NAME="BedrockEvalRole"

ROLE_ARN=$(aws iam create-role \
  --role-name "${ROLE_NAME}" \
  --assume-role-policy-document file://trust-policy.json \
  --description "Allows Bedrock to run evaluation jobs" \
  --query "Role.Arn" --output text)

aws iam put-role-policy \
  --role-name "${ROLE_NAME}" \
  --policy-name "BedrockEvalPolicy" \
  --policy-document file://permissions-policy.json

信任策略(必须包含
aws:SourceAccount
条件——Bedrock会拒绝不包含该条件的角色):
json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "bedrock.amazonaws.com" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "YOUR_ACCOUNT_ID"
        }
      }
    }
  ]
}
权限策略:
json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3DatasetRead",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET",
        "arn:aws:s3:::YOUR_BUCKET/datasets/*"
      ]
    },
    {
      "Sid": "S3ResultsWrite",
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
    },
    {
      "Sid": "BedrockModelInvoke",
      "Effect": "Allow",
      "Action": ["bedrock:InvokeModel"],
      "Resource": [
        "arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
      ]
    }
  ]
}
YOUR_BUCKET
REGION
EVALUATOR_MODEL_ID
替换为实际值。
创建角色:
bash
ROLE_NAME="BedrockEvalRole"

ROLE_ARN=$(aws iam create-role \
  --role-name "${ROLE_NAME}" \
  --assume-role-policy-document file://trust-policy.json \
  --description "Allows Bedrock to run evaluation jobs" \
  --query "Role.Arn" --output text)

aws iam put-role-policy \
  --role-name "${ROLE_NAME}" \
  --policy-name "BedrockEvalPolicy" \
  --policy-document file://permissions-policy.json

Step 5: Configure and Run Eval Job

步骤5:配置并运行评估任务

eval-config.json

eval-config.json

json
{
  "automated": {
    "datasetMetricConfigs": [
      {
        "taskType": "General",
        "dataset": {
          "name": "my-eval-dataset",
          "datasetLocation": {
            "s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
          }
        },
        "metricNames": [
          "Builtin.Helpfulness",
          "Builtin.FollowingInstructions",
          "Builtin.ProfessionalStyleAndTone",
          "Builtin.Relevance",
          "Builtin.Completeness",
          "Builtin.Correctness",
          "my_custom_metric_1",
          "my_custom_metric_2"
        ]
      }
    ],
    "evaluatorModelConfig": {
      "bedrockEvaluatorModels": [
        { "modelIdentifier": "EVALUATOR_MODEL_ID" }
      ]
    },
    "customMetricConfig": {
      "customMetrics": [
        {
          "customMetricDefinition": {
            "metricName": "my_custom_metric_1",
            "instructions": "... {{prompt}} ... {{prediction}} ...",
            "ratingScale": [
              { "definition": "Poor", "value": { "floatValue": 0 } },
              { "definition": "Good", "value": { "floatValue": 1 } }
            ]
          }
        }
      ],
      "evaluatorModelConfig": {
        "bedrockEvaluatorModels": [
          { "modelIdentifier": "EVALUATOR_MODEL_ID" }
        ]
      }
    }
  }
}
Critical structure notes:
  1. taskType
    must be
    "General"
    (not "Generation" or any other value)
  2. Custom metric names must appear in both
    metricNames
    array AND
    customMetrics
    array
  3. evaluatorModelConfig
    appears twice: once at the top level (for built-in metrics) and once inside
    customMetricConfig
    (for custom metrics) — both must specify the same evaluator model
  4. modelIdentifier
    must be the exact model ID string matching across all configs
json
{
  "automated": {
    "datasetMetricConfigs": [
      {
        "taskType": "General",
        "dataset": {
          "name": "my-eval-dataset",
          "datasetLocation": {
            "s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
          }
        },
        "metricNames": [
          "Builtin.Helpfulness",
          "Builtin.FollowingInstructions",
          "Builtin.ProfessionalStyleAndTone",
          "Builtin.Relevance",
          "Builtin.Completeness",
          "Builtin.Correctness",
          "my_custom_metric_1",
          "my_custom_metric_2"
        ]
      }
    ],
    "evaluatorModelConfig": {
      "bedrockEvaluatorModels": [
        { "modelIdentifier": "EVALUATOR_MODEL_ID" }
      ]
    },
    "customMetricConfig": {
      "customMetrics": [
        {
          "customMetricDefinition": {
            "metricName": "my_custom_metric_1",
            "instructions": "... {{prompt}} ... {{prediction}} ...",
            "ratingScale": [
              { "definition": "Poor", "value": { "floatValue": 0 } },
              { "definition": "Good", "value": { "floatValue": 1 } }
            ]
          }
        }
      ],
      "evaluatorModelConfig": {
        "bedrockEvaluatorModels": [
          { "modelIdentifier": "EVALUATOR_MODEL_ID" }
        ]
      }
    }
  }
}
关键结构说明:
  1. taskType
    必须设置为
    "General"
    (不能是"Generation"或其他值)
  2. 自定义指标名称必须同时出现在
    metricNames
    数组和
    customMetrics
    数组中
  3. evaluatorModelConfig
    出现两次:一次在顶层(用于内置指标),一次在
    customMetricConfig
    内部(用于自定义指标)——两者必须指定相同的评估者模型
  4. modelIdentifier
    必须与所有配置中的模型ID字符串完全匹配

inference-config.json

inference-config.json

For pre-computed inference, this tells Bedrock that responses are already collected:
json
{
  "models": [
    {
      "precomputedInferenceSource": {
        "inferenceSourceIdentifier": "my-app-v1"
      }
    }
  ]
}
The
inferenceSourceIdentifier
must match the
modelIdentifier
in your JSONL dataset's
modelResponses
.
对于预计算推理,该配置告知Bedrock响应已提前收集:
json
{
  "models": [
    {
      "precomputedInferenceSource": {
        "inferenceSourceIdentifier": "my-app-v1"
      }
    }
  ]
}
inferenceSourceIdentifier
必须与JSONL数据集
modelResponses
中的
modelIdentifier
匹配。

Running the Job

运行任务

bash
aws bedrock create-evaluation-job \
  --job-name "my-eval-$(date +%Y%m%d-%H%M)" \
  --role-arn "${ROLE_ARN}" \
  --evaluation-config file://eval-config.json \
  --inference-config file://inference-config.json \
  --output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
  --region us-east-1
CLI notes:
  • Required params:
    --job-name
    ,
    --role-arn
    ,
    --evaluation-config
    ,
    --inference-config
    ,
    --output-data-config
  • Optional:
    --application-type
    (e.g.,
    ModelEvaluation
    )
  • --job-name
    constraint:
    [a-z0-9](-*[a-z0-9]){0,62}
    — lowercase + hyphens only, max 63 chars. Must be unique (use timestamps).
  • --evaluation-config
    and
    --inference-config
    are document types — must use
    file://
    or inline JSON, no shorthand syntax
  • --output-data-config
    is a structure — supports both inline JSON and shorthand (
    s3Uri=string
    )
bash
aws bedrock create-evaluation-job \
  --job-name "my-eval-$(date +%Y%m%d-%H%M)" \
  --role-arn "${ROLE_ARN}" \
  --evaluation-config file://eval-config.json \
  --inference-config file://inference-config.json \
  --output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
  --region us-east-1
CLI注意事项:
  • 必填参数:
    --job-name
    ,
    --role-arn
    ,
    --evaluation-config
    ,
    --inference-config
    ,
    --output-data-config
  • 可选参数:
    --application-type
    (例如:
    ModelEvaluation
  • --job-name
    约束:
    [a-z0-9](-*[a-z0-9]){0,62}
    — 仅允许小写字母+连字符,最多63个字符。必须唯一(使用时间戳)。
  • --evaluation-config
    --inference-config
    为文档类型——必须使用
    file://
    或内联JSON,不支持简写语法
  • --output-data-config
    为结构化参数——支持内联JSON和简写(
    s3Uri=string

Monitoring

监控

bash
undefined
bash
undefined

List evaluation jobs (with optional filters)

列出评估任务(可添加筛选条件)

aws bedrock list-evaluation-jobs --region us-east-1 aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1 aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1
aws bedrock list-evaluation-jobs --region us-east-1 aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1 aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1

Get details for a specific job

获取特定任务的详情

aws bedrock get-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1
aws bedrock get-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1

Cancel a running job

取消运行中的任务

aws bedrock stop-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1

**Job statuses:** `InProgress`, `Completed`, `Failed`, `Stopping`, `Stopped`, `Deleting`

Jobs typically take 5-15 minutes for 30-50 entry datasets. If a job fails, check `failureMessages` in the job details.

---
aws bedrock stop-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1

**任务状态:** `InProgress`, `Completed`, `Failed`, `Stopping`, `Stopped`, `Deleting`

对于包含30-50条条目的数据集,任务通常需要5-15分钟完成。如果任务失败,请检查任务详情中的`failureMessages`。

---

Step 6: Parse Results

步骤6:解析结果

S3 Output Directory Structure

S3输出目录结构

Bedrock writes results to a deeply nested path:
s3://YOUR_BUCKET/results/
  └── <job-name>/
      └── <job-name>/
          ├── amazon-bedrock-evaluations-permission-check   ← empty sentinel
          └── <random-id>/
              ├── custom_metrics/                            ← metric definitions (NOT results)
              └── models/
                  └── <model-identifier>/
                      └── taskTypes/General/datasets/<dataset-name>/
                          └── <uuid>_output.jsonl            ← actual results
The job name is repeated twice. The random ID changes every run. Use
aws s3 sync
— do not construct paths manually.
Bedrock会将结果写入深度嵌套的路径:
s3://YOUR_BUCKET/results/
  └── <job-name>/
      └── <job-name>/
          ├── amazon-bedrock-evaluations-permission-check   ← 空的标记文件
          └── <random-id>/
              ├── custom_metrics/                            ← 指标定义(非结果)
              └── models/
                  └── <model-identifier>/
                      └── taskTypes/General/datasets/<dataset-name>/
                          └── <uuid>_output.jsonl            ← 实际结果
任务名称会重复两次。随机ID每次运行都会变化。使用
aws s3 sync
同步——不要手动构造路径。

Download Results

下载结果

bash
aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1
bash
aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1

Result JSONL Format

结果JSONL格式

Each line:
json
{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Builtin.Helpfulness",
        "result": 0.6667,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "The response provides useful information..."
          }
        ]
      },
      {
        "metricName": "confirmation_check",
        "result": null,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "This conversation does not involve any consequential action..."
          }
        ]
      }
    ]
  },
  "inputRecord": {
    "prompt": "hello",
    "referenceResponse": "",
    "modelResponses": [
      { "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
    ]
  }
}
  • result
    is a number (score) or
    null
    (N/A)
  • evaluatorDetails[0].explanation
    contains the judge's written reasoning
每一行:
json
{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Builtin.Helpfulness",
        "result": 0.6667,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "The response provides useful information..."
          }
        ]
      },
      {
        "metricName": "confirmation_check",
        "result": null,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "This conversation does not involve any consequential action..."
          }
        ]
      }
    ]
  },
  "inputRecord": {
    "prompt": "hello",
    "referenceResponse": "",
    "modelResponses": [
      { "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
    ]
  }
}
  • result
    为数值(分数)或
    null
    (不适用)
  • evaluatorDetails[0].explanation
    包含评判者的书面推理过程

Parsing and Aggregation

解析与聚合

typescript
interface PromptResult {
  prompt: string;
  category: string;
  modelResponse: string;
  scores: Record<string, {
    score: string;
    reasoning?: string;
    rawScore?: number;
  }>;
}

for (const s of entry.automatedEvaluationResult.scores) {
  scores[s.metricName] = {
    score: s.result === null ? "N/A" : String(s.result),
    reasoning: s.evaluatorDetails?.[0]?.explanation,
    rawScore: typeof s.result === "number" ? s.result : undefined,
  };
}
Aggregation approach:
  1. Overall averages per metric — exclude N/A entries
  2. Per-category breakdown — group by category field, compute averages within each
  3. Low-score alerts — flag entries below threshold (built-in < 0.5, custom <= 0)
Low-score alert format:
[Builtin.Relevance] score=0.50 | "hello..."
  Reason: The response does not directly address the greeting...

[confirmation_check] score=0.00 | "User: proceed with X..."
  Reason: The assistant executed the action without asking for confirmation...

typescript
interface PromptResult {
  prompt: string;
  category: string;
  modelResponse: string;
  scores: Record<string, {
    score: string;
    reasoning?: string;
    rawScore?: number;
  }>;
}

for (const s of entry.automatedEvaluationResult.scores) {
  scores[s.metricName] = {
    score: s.result === null ? "N/A" : String(s.result),
    reasoning: s.evaluatorDetails?.[0]?.explanation,
    rawScore: typeof s.result === "number" ? s.result : undefined,
  };
}
聚合方法:
  1. 各指标的总体平均值 — 排除不适用条目
  2. 按类别细分 — 按category字段分组,计算每个类别内的平均值
  3. 低分告警 — 标记低于阈值的条目(内置指标<0.5,自定义指标<=0)
低分告警格式:
[Builtin.Relevance] score=0.50 | "hello..."
  Reason: The response does not directly address the greeting...

[confirmation_check] score=0.00 | "User: proceed with X..."
  Reason: The assistant executed the action without asking for confirmation...

Step 7: Eval-Fix-Reeval Loop

步骤7:评估-修复-重新评估循环

Common Fixes

常见修复措施

FindingFix
Low brevity scoresAdd hard constraint: "Respond in no more than 3 sentences."
Low confirmation_checkAdd: "Before executing, summarize details and ask for confirmation."
Low missing_info_followupAdd: "If any required field is missing, ask for it. Do not assume."
Low tone on negative outcomesAdd empathy instructions for bad-news scenarios
Low Completeness on simple promptsMetric/data issue — add
referenceResponse
or filter from Completeness
发现问题修复方案
简洁性分数低添加硬性约束:"Respond in no more than 3 sentences."
confirmation_check分数低添加:"Before executing, summarize details and ask for confirmation."
缺失信息跟进分数低添加:"If any required field is missing, ask for it. Do not assume."
负面结果的语气分数低为坏消息场景添加共情指令
简单提示词的完整性分数低指标/数据问题——添加
referenceResponse
或从完整性指标中过滤此类提示词

Metric Refinement

指标优化

  • High N/A rates (>60%) — metric too narrowly scoped. Split dataset or adjust scope.
  • All-high scores — instructions too lenient. Add specific failure criteria.
  • Inconsistent scoring — instructions ambiguous. Add concrete examples per rating level.
  • 高不适用率(>60%)——指标范围过窄。拆分数据集或调整指标范围。
  • 全高分——指令过于宽松。添加具体的失败判定标准。
  • 评分不一致——指令模糊。为每个评分等级添加具体示例。

Run Comparison

运行对比

Run 1 (baseline):    response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes):  response_brevity avg=0.85, custom_tone avg=0.90
Track scores over time. The pipeline's value comes from repeated measurement.

Run 1 (baseline):    response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes):  response_brevity avg=0.85, custom_tone avg=0.90
随时间跟踪分数变化。流水线的价值在于反复测量。

Gotchas

注意事项

  1. taskType
    must be
    "General"
    — not "Generation" or any other value. The job fails silently with other values.
  2. Custom metric names in BOTH places — must appear in
    metricNames
    array AND
    customMetrics
    array. Missing from
    metricNames
    = silently ignored. Missing from
    customMetrics
    = job fails.
  3. null
    result means N/A, not 0
    — when the judge determines a metric doesn't apply, Bedrock records
    null
    :
    typescript
    // WRONG — treats N/A as 0
    const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length;
    
    // RIGHT — excludes N/A from average
    const numericScores = scores.filter((s): s is number => s !== null);
    const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length;
  4. evaluatorModelConfig
    appears twice
    — once at top level (built-in metrics), once inside
    customMetricConfig
    (custom metrics). Omitting either causes those metrics to fail.
  5. modelIdentifier
    must match exactly
    — the string in JSONL
    modelResponses
    must be character-for-character identical to
    inferenceSourceIdentifier
    in inference-config.json. Mismatch = model mapping error.
  6. AWS CLI 2.33+ required — older versions silently drop
    customMetricConfig
    and
    precomputedInferenceSource
    . Job creation succeeds but the job fails. Always check
    aws --version
    .
  7. Job names: lowercase + hyphens, max 63 chars — pattern:
    [a-z0-9](-*[a-z0-9]){0,62}
    . Must be unique across all jobs. Use timestamps:
    --job-name "my-eval-$(date +%Y%m%d-%H%M)"
    .
  8. S3 output is deeply nested
    <prefix>/<job-name>/<job-name>/<random-id>/models/...
    . Use
    aws s3 sync
    and search for
    _output.jsonl
    . Do not construct paths manually.
  9. referenceResponse
    improves Correctness/Completeness
    — empty string is valid, but providing reference responses gives the judge a baseline for comparison.
  10. <thinking>
    tag leakage (model-specific)
    — some models (e.g., Amazon Nova Lite) may leak
    <thinking>...</thinking>
    blocks into responses. If present, strip before writing JSONL:
    typescript
    const clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim();
  11. us-east-1 S3 bucket creation — do NOT pass
    LocationConstraint
    for
    us-east-1
    . Other regions require it.

  1. taskType
    必须设置为
    "General"
    — 不能是"Generation"或其他值。设置为其他值会导致任务静默失败。
  2. 自定义指标名称必须出现在两个位置 — 必须同时出现在
    metricNames
    数组和
    customMetrics
    数组中。未出现在
    metricNames
    中会被静默忽略。未出现在
    customMetrics
    中会导致任务失败。
  3. null
    结果表示不适用,而非0分
    — 当评判者判定指标不适用时,Bedrock会记录
    null
    typescript
    // 错误 — 将不适用视为0分
    const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length;
    
    // 正确 — 排除不适用条目计算平均值
    const numericScores = scores.filter((s): s is number => s !== null);
    const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length;
  4. evaluatorModelConfig
    出现两次
    — 一次在顶层(用于内置指标),一次在
    customMetricConfig
    内部(用于自定义指标)。省略任何一处都会导致对应指标失败。
  5. modelIdentifier
    必须完全匹配
    — JSONL的
    modelResponses
    中的字符串必须与inference-config.json中的
    inferenceSourceIdentifier
    完全一致。不匹配会导致模型映射错误。
  6. 需要AWS CLI 2.33+版本 — 旧版本会静默丢弃
    customMetricConfig
    precomputedInferenceSource
    。任务创建会成功,但任务会失败。请始终检查
    aws --version
  7. 任务名称:小写字母+连字符,最多63个字符 — 格式:
    [a-z0-9](-*[a-z0-9]){0,62}
    。所有任务中必须唯一。使用时间戳:
    --job-name "my-eval-$(date +%Y%m%d-%H%M)"
  8. S3输出路径深度嵌套
    <prefix>/<job-name>/<job-name>/<random-id>/models/...
    。使用
    aws s3 sync
    同步并搜索
    _output.jsonl
    。不要手动构造路径。
  9. referenceResponse
    提升正确性/完整性指标效果
    — 空字符串是合法的,但提供参考响应能为评判者提供对比基准。
  10. <thinking>
    标签泄露(模型特定)
    — 部分模型(例如Amazon Nova Lite)可能会在响应中泄露
    <thinking>...</thinking>
    块。如果存在此类内容,写入JSONL前需去除:
    typescript
    const clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim();
  11. us-east-1区域S3存储桶创建 — 不要为
    us-east-1
    区域传递
    LocationConstraint
    。其他区域需要该参数。

Cost Estimation

成本估算

Formula:
Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price
Example: 30 prompts, 10 metrics, Nova Pro judge:
  • Response collection (Nova Lite): ~$0.02
  • Evaluation job (Nova Pro): ~$0.58
  • Total per run: ~$0.61
Scaling: Cost is linear with prompts and metrics. 100 prompts x 10 metrics ≈ $5. Judge cost dominates at ~95%. Adding 1 custom metric adds ~$0.06/run (30 prompts, Nova Pro).

公式:
Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price
示例: 30个提示词,10个指标,使用Nova Pro作为评判者:
  • 响应收集(Nova Lite):~$0.02
  • 评估任务(Nova Pro):~$0.58
  • 每次运行总成本:~$0.61
扩展: 成本与提示词数量和指标数量呈线性关系。100个提示词×10个指标≈$5。评判者成本占比约95%。添加1个自定义指标会使每次运行成本增加约$0.06(30个提示词,Nova Pro)。

References

参考资料