aws-bedrock-evals
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAWS Bedrock Evaluation Jobs
AWS Bedrock Evaluation Jobs
Overview
概述
Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.
Pre-computed Inference vs Live Inference
| Mode | How it works | Use when |
|---|---|---|
| Live Inference | Bedrock generates responses during the eval job | Simple prompt-in/text-out, no tool calling |
| Pre-computed Inference | You pre-collect responses and supply them in a JSONL dataset | Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock |
Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.
Pipeline
Design Scenarios → Collect Responses → Upload to S3 → Run Eval Job → Parse Results → Act on Findings
| | | | | |
scenarios.json Your app's API s3://bucket/ create- s3 sync + Fix prompt,
(multi-turn) → dataset JSONL datasets/ evaluation-job parse JSONL retune metricsAmazon Bedrock Evaluation Jobs通过使用独立的评估者模型(即“评判者”),根据一组指标对提示词-响应配对进行评分,以此衡量基于Bedrock构建的应用的性能。评判者会结合特定指标的指令读取每一对内容,并给出数值评分和书面推理过程。
预计算推理 vs 实时推理
| 模式 | 工作原理 | 适用场景 |
|---|---|---|
| 实时推理 | Bedrock在评估任务执行过程中生成响应 | 简单的提示词输入-文本输出场景,无需工具调用 |
| 预计算推理 | 您预先收集响应并以JSONL数据集的形式提供 | 涉及工具调用、多轮对话、自定义编排、或使用Bedrock之外模型的场景 |
当您的应用涉及工具使用、Agent循环、多轮状态管理或外部编排时,使用预计算推理。
流水线流程
设计测试场景 → 收集响应 → 上传至S3 → 运行评估任务 → 解析结果 → 根据发现优化
| | | | | |
scenarios.json 您的应用API s3://bucket/ create- s3同步 + 优化提示词,
(多轮对话) → 数据集JSONL datasets/ evaluation-job 解析JSONL 调整指标Agent Behavior: Gather Inputs and Show Cost Estimate
Agent行为:收集输入并展示成本估算
Before generating any configs, scripts, or artifacts, you MUST gather the following from the user:
- AWS Region — Which region to use (default: ). Affects model availability and pricing.
us-east-1 - Target model — The model their application uses (e.g., ,
amazon.nova-lite-v1:0).anthropic.claude-3-haiku - Evaluator (judge) model — The model to score responses (e.g., ). Should be at least as capable as the target.
amazon.nova-pro-v1:0 - Application type — Brief description of what the app does. Used to design test scenarios and derive custom metrics.
- Number of test scenarios — How many they plan to test (recommend 13-20 for first run).
- Estimated JSONL entries — Derived from scenarios x avg turns per scenario.
- Number of metrics — Total (built-in + custom). Recommend starting with 6 built-in + 3-5 custom.
- S3 bucket — Existing bucket name or confirm creation of a new one.
- IAM role — Existing role ARN or confirm creation of a new one.
在生成任何配置、脚本或工件之前,您必须向用户收集以下信息:
- AWS区域 — 使用的区域(默认值:)。会影响模型可用性和定价。
us-east-1 - 目标模型 — 应用所使用的模型(例如:,
amazon.nova-lite-v1:0)。anthropic.claude-3-haiku - 评估者(评判者)模型 — 用于对响应评分的模型(例如:)。其能力应至少与目标模型相当。
amazon.nova-pro-v1:0 - 应用类型 — 应用功能的简要描述。用于设计测试场景和推导自定义指标。
- 测试场景数量 — 计划测试的场景数量(首次运行建议13-20个)。
- 预估JSONL条目数 — 由场景数 × 平均每场景对话轮次得出。
- 指标数量 — 总数量(内置指标 + 自定义指标)。建议从6个内置指标 + 3-5个自定义指标开始。
- S3存储桶 — 现有存储桶名称,或确认是否创建新存储桶。
- IAM角色 — 现有角色ARN,或确认是否创建新角色。
Cost Estimate
成本估算
After gathering inputs, you MUST display a cost estimate before proceeding:
undefined收集完输入信息后,您必须先展示成本估算,再继续后续操作:
undefinedEstimated Cost Summary
预估成本汇总
| Item | Details | Est. Cost |
|---|---|---|
| Response collection | {N} prompts x ~{T} tokens x {target_model_price} | ${X.XX} |
| Evaluation job | {N} prompts x {M} metrics x ~1,700 tokens x {judge_model_price} | ${X.XX} |
| S3 storage | < 1 MB | < $0.01 |
| Total per run | ~${X.XX} |
Scaling: Each additional run costs ~${X.XX}. Adding 1 custom metric adds ~${Y.YY}/run.
**Cost formulas:**
- **Response collection**: `num_prompts x avg_input_tokens x input_price + num_prompts x avg_output_tokens x output_price`
- **Evaluation job**: `num_prompts x num_metrics x ~1,500 input_tokens x judge_input_price + num_prompts x num_metrics x ~200 output_tokens x judge_output_price`
**Model pricing reference:**
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|-------|----------------------|------------------------|
| amazon.nova-lite-v1:0 | $0.06 | $0.24 |
| amazon.nova-pro-v1:0 | $0.80 | $3.20 |
| anthropic.claude-3-haiku | $0.25 | $1.25 |
| anthropic.claude-3-sonnet | $3.00 | $15.00 |
---| 项目 | 详情 | 预估成本 |
|---|---|---|
| 响应收集 | {N}个提示词 × ~{T}个token × {target_model_price} | ${X.XX} |
| 评估任务 | {N}个提示词 × {M}个指标 × ~1,700个token × {judge_model_price} | ${X.XX} |
| S3存储 | < 1 MB | < $0.01 |
| 每次运行总成本 | ~${X.XX} |
扩展说明:每额外运行一次成本约为${X.XX}。添加1个自定义指标会使每次运行成本增加约${Y.YY}。
**成本计算公式:**
- **响应收集**:`提示词数量 × 平均输入token数 × 输入价格 + 提示词数量 × 平均输出token数 × 输出价格`
- **评估任务**:`提示词数量 × 指标数量 × ~1,500个输入token × 评判者模型输入价格 + 提示词数量 × 指标数量 × ~200个输出token × 评判者模型输出价格`
**模型定价参考:**
| 模型 | 输入(每100万token) | 输出(每100万token) |
|-------|----------------------|------------------------|
| amazon.nova-lite-v1:0 | $0.06 | $0.24 |
| amazon.nova-pro-v1:0 | $0.80 | $3.20 |
| anthropic.claude-3-haiku | $0.25 | $1.25 |
| anthropic.claude-3-sonnet | $3.00 | $15.00 |
---Prerequisites
前置条件
bash
undefinedbash
undefinedAWS CLI 2.33+ required (older versions silently drop customMetricConfig/precomputedInferenceSource fields)
需要AWS CLI 2.33+版本(旧版本会静默丢弃customMetricConfig/precomputedInferenceSource字段)
aws --version
aws --version
Verify target model access
验证目标模型权限
aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION
aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION
Verify evaluator model access
验证评估者模型权限
aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION
Good evaluator model choices: `amazon.nova-pro-v1:0`, `anthropic.claude-3-sonnet`, `anthropic.claude-3-haiku`. The evaluator should be at least as capable as your target model.
---aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION
推荐的评估者模型:`amazon.nova-pro-v1:0`, `anthropic.claude-3-sonnet`, `anthropic.claude-3-haiku`。评估者模型的能力应至少与您的目标模型相当。
---Step 1: Design Test Scenarios
步骤1:设计测试场景
List the application's functional areas (e.g., greeting, booking-flow, error-handling, etc.). Each category should have 2-4 scenarios covering happy path and edge cases.
Scenario JSON format:
json
[
{
"id": "greeting-known-user",
"category": "greeting",
"context": { "userId": "user-123" },
"turns": ["hello"]
},
{
"id": "multi-step-flow",
"category": "core-flow",
"context": { "userId": "user-456" },
"turns": [
"hello",
"I need help with X",
"yes, proceed with that",
"thanks"
]
}
]The field holds any session/user data your app needs. Each turn in the array is one user message; the collection step handles the multi-turn conversation loop.
contextEdge case coverage dimensions:
- Happy path: standard usage that should work perfectly
- Missing information: user omits required fields
- Unavailable resources: requested item doesn't exist
- Out-of-scope requests: user asks something the app shouldn't handle
- Error recovery: bad input, invalid data
- Tone stress tests: complaints, frustration
Recommended count: 13-20 scenarios producing 30-50 JSONL entries (multi-turn scenarios produce one entry per turn).
列出应用的功能领域(例如:问候、预订流程、错误处理等)。每个类别应包含2-4个场景,覆盖正常路径和边缘情况。
场景JSON格式:
json
[
{
"id": "greeting-known-user",
"category": "greeting",
"context": { "userId": "user-123" },
"turns": ["hello"]
},
{
"id": "multi-step-flow",
"category": "core-flow",
"context": { "userId": "user-456" },
"turns": [
"hello",
"I need help with X",
"yes, proceed with that",
"thanks"
]
}
]context边缘情况覆盖维度:
- 正常路径:应完美运行的标准使用场景
- 信息缺失:用户遗漏必填字段
- 资源不可用:请求的项目不存在
- 超出范围的请求:用户询问应用不应处理的内容
- 错误恢复:无效输入、错误数据
- 语气压力测试:投诉、不满情绪
推荐数量: 13-20个场景,生成30-50条JSONL条目(多轮场景每轮对话生成一条条目)。
Step 2: Collect Responses
步骤2:收集响应
Collect responses from your application however it runs. The goal is to produce a JSONL dataset file where each line contains the prompt, the model's response, and metadata.
Example pattern: Converse API with tool-calling loop (TypeScript)
This applies when your application uses Bedrock with tool calling:
typescript
import {
BedrockRuntimeClient,
ConverseCommand,
type Message,
type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";
const client = new BedrockRuntimeClient({ region: "us-east-1" });
async function converseLoop(
messages: Message[],
systemPrompt: SystemContentBlock[],
tools: any[]
): Promise<string> {
const MAX_TOOL_ROUNDS = 10;
for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
const response = await client.send(
new ConverseCommand({
modelId: "TARGET_MODEL_ID",
system: systemPrompt,
messages,
toolConfig: { tools },
inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
})
);
const assistantContent = response.output?.message?.content as any[];
if (!assistantContent) return "[No response from model]";
messages.push({ role: "assistant", content: assistantContent });
const toolUseBlocks = assistantContent.filter(
(block: any) => block.toolUse != null
);
if (toolUseBlocks.length === 0) {
return assistantContent
.filter((block: any) => block.text != null)
.map((block: any) => block.text as string)
.join("\n") || "[Empty response]";
}
const toolResultBlocks: any[] = [];
for (const block of toolUseBlocks) {
const { toolUseId, name, input } = block.toolUse;
const result = await executeTool(name, input);
toolResultBlocks.push({
toolResult: { toolUseId, content: [{ json: result }] },
});
}
messages.push({ role: "user", content: toolResultBlocks } as Message);
}
return "[Max tool rounds exceeded]";
}Multi-turn handling: Maintain the array across turns and build the dataset prompt field with conversation history:
messagestypescript
const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];
for (let i = 0; i < scenario.turns.length; i++) {
const userTurn = scenario.turns[i];
messages.push({ role: "user", content: [{ text: userTurn }] });
const assistantText = await converseLoop(messages, systemPrompt, tools);
conversationHistory.push({ role: "user", text: userTurn });
conversationHistory.push({ role: "assistant", text: assistantText });
let prompt: string;
if (i === 0) {
prompt = userTurn;
} else {
prompt = conversationHistory
.map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
.join("\n");
}
entries.push({
prompt,
category: scenario.category,
referenceResponse: "",
modelResponses: [
{ response: assistantText, modelIdentifier: "my-app-v1" },
],
});
}通过应用的运行方式收集响应。目标是生成一个JSONL数据集文件,其中每一行包含提示词、模型响应和元数据。
示例模式:带工具调用循环的Converse API(TypeScript)
适用于您的应用使用Bedrock进行工具调用的场景:
typescript
import {
BedrockRuntimeClient,
ConverseCommand,
type Message,
type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";
const client = new BedrockRuntimeClient({ region: "us-east-1" });
async function converseLoop(
messages: Message[],
systemPrompt: SystemContentBlock[],
tools: any[]
): Promise<string> {
const MAX_TOOL_ROUNDS = 10;
for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
const response = await client.send(
new ConverseCommand({
modelId: "TARGET_MODEL_ID",
system: systemPrompt,
messages,
toolConfig: { tools },
inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
})
);
const assistantContent = response.output?.message?.content as any[];
if (!assistantContent) return "[No response from model]";
messages.push({ role: "assistant", content: assistantContent });
const toolUseBlocks = assistantContent.filter(
(block: any) => block.toolUse != null
);
if (toolUseBlocks.length === 0) {
return assistantContent
.filter((block: any) => block.text != null)
.map((block: any) => block.text as string)
.join("\n") || "[Empty response]";
}
const toolResultBlocks: any[] = [];
for (const block of toolUseBlocks) {
const { toolUseId, name, input } = block.toolUse;
const result = await executeTool(name, input);
toolResultBlocks.push({
toolResult: { toolUseId, content: [{ json: result }] },
});
}
messages.push({ role: "user", content: toolResultBlocks } as Message);
}
return "[Max tool rounds exceeded]";
}多轮对话处理: 在对话轮次中维护数组,并使用对话历史构建数据集中的prompt字段:
messagestypescript
const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];
for (let i = 0; i < scenario.turns.length; i++) {
const userTurn = scenario.turns[i];
messages.push({ role: "user", content: [{ text: userTurn }] });
const assistantText = await converseLoop(messages, systemPrompt, tools);
conversationHistory.push({ role: "user", text: userTurn });
conversationHistory.push({ role: "assistant", text: assistantText });
let prompt: string;
if (i === 0) {
prompt = userTurn;
} else {
prompt = conversationHistory
.map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
.join("\n");
}
entries.push({
prompt,
category: scenario.category,
referenceResponse: "",
modelResponses: [
{ response: assistantText, modelIdentifier: "my-app-v1" },
],
});
}Dataset JSONL Format
数据集JSONL格式
Each line must have this structure:
json
{
"prompt": "User question or multi-turn history",
"referenceResponse": "",
"modelResponses": [
{
"response": "The model's actual output text",
"modelIdentifier": "my-app-v1"
}
]
}| Field | Required | Notes |
|---|---|---|
| Yes | User input. For multi-turn, concatenate: |
| No | Expected/ideal response. Can be empty string. Needed for |
| Yes | Array with exactly one entry for pre-computed inference |
| Yes | The model's actual output text |
| Yes | Any string label. Must match |
Constraints: One model response per prompt. One unique per job. Max 1000 prompts per job.
modelIdentifierWrite JSONL:
typescript
const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");每一行必须符合以下结构:
json
{
"prompt": "用户问题或多轮对话历史",
"referenceResponse": "",
"modelResponses": [
{
"response": "模型的实际输出文本",
"modelIdentifier": "my-app-v1"
}
]
}| 字段 | 是否必填 | 说明 |
|---|---|---|
| 是 | 用户输入。对于多轮对话,需按以下格式拼接: |
| 否 | 预期/理想响应。可以为空字符串。 |
| 是 | 数组,预计算推理场景下必须包含恰好一个条目 |
| 是 | 模型的实际输出文本 |
| 是 | 任意字符串标签。必须与inference-config.json中的 |
约束条件: 每个提示词对应一个模型响应。每个任务使用一个唯一的。每个任务最多支持1000个提示词。
modelIdentifier写入JSONL:
typescript
const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");Step 3: Design Metrics
步骤3:设计指标
Built-In Metrics
内置指标
Bedrock provides 11 built-in metrics requiring no configuration beyond listing them by name:
| Metric Name | What It Measures |
|---|---|
| Is the factual content accurate? (works best with |
| Does the response fully cover the request? (works best with |
| Is the response faithful to the provided context/source? |
| Is the response useful, actionable, and cooperative? |
| Is the response logically structured and easy to follow? |
| Does the response address the actual question? |
| Does the response follow explicit instructions in the prompt? |
| Is spelling, grammar, and tone appropriate? |
| Does the response contain harmful content? |
| Does the response contain stereotypes or bias? |
| Does the response appropriately refuse harmful requests? |
Score interpretation: = best, = worst, = N/A (judge could not evaluate).
1.00.0nullNote: is needed for and to produce meaningful scores, since the judge compares against a reference baseline.
referenceResponseBuiltin.CorrectnessBuiltin.CompletenessBedrock提供11个内置指标,无需额外配置,只需按名称列出即可:
| 指标名称 | 衡量内容 |
|---|---|
| 事实内容是否准确?(配合 |
| 响应是否完全覆盖请求内容?(配合 |
| 响应是否与提供的上下文/来源一致? |
| 响应是否有用、可操作且配合度高? |
| 响应逻辑结构是否清晰、易于理解? |
| 响应是否直接回应用户问题? |
| 响应是否遵循提示词中的明确指令? |
| 拼写、语法和语气是否恰当? |
| 响应是否包含有害内容? |
| 响应是否包含刻板印象或偏见? |
| 响应是否恰当地拒绝有害请求? |
分数解读: = 最佳, = 最差, = 不适用(评判者无法评估)。
1.00.0null注意: 是和指标生成有意义分数的必要条件,因为评判者需要对比参考基准。
referenceResponseBuiltin.CorrectnessBuiltin.CompletenessWhen to Use Custom Metrics
何时使用自定义指标
Use custom metrics to check domain-specific behaviors the built-in metrics don't cover. If you find yourself thinking "this scored well on Helpfulness but violated a critical business rule" — that's a custom metric.
Technique: Extract rules from your system prompt. Every rule in your system prompt is a candidate metric:
System prompt says: Candidate metric:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max" → response_brevity
"Always greet returning users by name" → personalized_greeting
"Never proceed without user confirmation" → confirmation_check
"Ask for missing details, don't assume" → missing_info_followup当内置指标无法覆盖领域特定行为时,使用自定义指标。如果您发现“这个响应在Helpfulness上得分很高,但违反了关键业务规则”,那么就需要自定义指标。
技巧:从系统提示词中提取规则。 系统提示词中的每一条规则都是自定义指标的候选:
系统提示词内容: 候选指标:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max" → 响应简洁性
"Always greet returning users by name" → 个性化问候
"Never proceed without user confirmation" → 确认检查
"Ask for missing details, don't assume" → 缺失信息跟进Custom Metric JSON Anatomy
自定义指标JSON结构
json
{
"customMetricDefinition": {
"metricName": "my_metric_name",
"instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
"ratingScale": [
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}| Field | Details |
|---|---|
| Snake_case identifier. Must appear in BOTH |
| Full prompt sent to the judge. Must include |
| Array of rating levels. Each has a |
Official constraints:
- Max 10 custom metrics per job
- Instructions max 5000 characters
- Rating max 5 words / 100 characters
definition - Input variables (,
{{prompt}},{{prediction}}) must come last in the instruction text{{ground_truth}}
json
{
"customMetricDefinition": {
"metricName": "my_metric_name",
"instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
"ratingScale": [
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}| 字段 | 详情 |
|---|---|
| 蛇形命名标识符。必须同时出现在 |
| 发送给评判者的完整提示词。必须包含 |
| 评分等级数组。每个等级包含 |
官方约束条件:
- 每个任务最多支持10个自定义指标
- 指令最多5000个字符
- 评分最多5个单词 / 100个字符
definition - 输入变量(,
{{prompt}},{{prediction}})必须放在指令文本的最后{{ground_truth}}
Complete Custom Metric Example
完整自定义指标示例
A metric that checks whether the assistant follows a domain-specific rule, with N/A handling for irrelevant prompts:
json
{
"customMetricDefinition": {
"metricName": "confirmation_check",
"instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
"ratingScale": [
{ "definition": "N/A", "value": { "floatValue": -1 } },
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}When the judge selects N/A (), Bedrock records . Your parser must handle — treat as N/A and exclude from averages.
floatValue: -1"result": nullnull用于检查助手是否遵循领域特定规则的指标,包含对不适用场景的处理:
json
{
"customMetricDefinition": {
"metricName": "confirmation_check",
"instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
"ratingScale": [
{ "definition": "N/A", "value": { "floatValue": -1 } },
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}当评判者选择不适用()时,Bedrock会记录。您的解析器必须处理值——将其视为不适用并排除在平均值计算之外。
floatValue: -1"result": nullnullRating Scale Design
评分等级设计
- 3-4 levels for quality scales (Poor/Acceptable/Good/Excellent)
- 2 levels for binary checks (Poor/Good)
- Add "N/A" level with for conditional metrics that only apply to certain prompt types
-1 - Rating values can use (numeric) or
floatValue(text)stringValue
- 质量评分使用3-4个等级(差/可接受/好/优秀)
- 二元检查使用2个等级(差/好)
- 对于仅适用于特定提示词类型的条件指标,添加"不适用"等级并设置值为
-1 - 评分值可以使用(数值)或
floatValue(文本)stringValue
Tips for Writing Metric Instructions
编写指标指令的技巧
- Be explicit about what "good" and "bad" look like — include examples of phrases or behaviors
- For conditional metrics, describe the N/A condition clearly so the judge doesn't score 0 when it should skip
- Keep instructions under ~500 words to fit within context alongside prompt and response
- Test with a few examples before running a full eval job
- 明确说明“好”和“差”的表现——包含短语或行为示例
- 对于条件指标,清晰描述不适用场景,避免评判者在应跳过评分时给出0分
- 指令控制在约500词以内,以便与提示词和响应一起放入上下文窗口
- 在运行完整评估任务前,先用几个示例测试指标
Step 4: AWS Infrastructure
步骤4:AWS基础设施
S3 Bucket
S3存储桶
bash
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"bash
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"us-east-1 does not accept LocationConstraint
us-east-1区域不需要LocationConstraint
if [ "${REGION}" = "us-east-1" ]; then
aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
else
aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
--create-bucket-configuration LocationConstraint="${REGION}" fi
--create-bucket-configuration LocationConstraint="${REGION}" fi
Upload the dataset:
```bash
aws s3 cp datasets/collected-responses.jsonl \
"s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"if [ "${REGION}" = "us-east-1" ]; then
aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
else
aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
--create-bucket-configuration LocationConstraint="${REGION}" fi
--create-bucket-configuration LocationConstraint="${REGION}" fi
上传数据集:
```bash
aws s3 cp datasets/collected-responses.jsonl \
"s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"IAM Role
IAM角色
Trust policy (must include condition — Bedrock rejects the role without it):
aws:SourceAccountjson
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "bedrock.amazonaws.com" },
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"aws:SourceAccount": "YOUR_ACCOUNT_ID"
}
}
}
]
}Permissions policy:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3DatasetRead",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::YOUR_BUCKET",
"arn:aws:s3:::YOUR_BUCKET/datasets/*"
]
},
{
"Sid": "S3ResultsWrite",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject"],
"Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
},
{
"Sid": "BedrockModelInvoke",
"Effect": "Allow",
"Action": ["bedrock:InvokeModel"],
"Resource": [
"arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
]
}
]
}Replace , , and with actual values.
YOUR_BUCKETREGIONEVALUATOR_MODEL_IDCreate the role:
bash
ROLE_NAME="BedrockEvalRole"
ROLE_ARN=$(aws iam create-role \
--role-name "${ROLE_NAME}" \
--assume-role-policy-document file://trust-policy.json \
--description "Allows Bedrock to run evaluation jobs" \
--query "Role.Arn" --output text)
aws iam put-role-policy \
--role-name "${ROLE_NAME}" \
--policy-name "BedrockEvalPolicy" \
--policy-document file://permissions-policy.json信任策略(必须包含条件——Bedrock会拒绝不包含该条件的角色):
aws:SourceAccountjson
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "bedrock.amazonaws.com" },
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"aws:SourceAccount": "YOUR_ACCOUNT_ID"
}
}
}
]
}权限策略:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3DatasetRead",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::YOUR_BUCKET",
"arn:aws:s3:::YOUR_BUCKET/datasets/*"
]
},
{
"Sid": "S3ResultsWrite",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject"],
"Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
},
{
"Sid": "BedrockModelInvoke",
"Effect": "Allow",
"Action": ["bedrock:InvokeModel"],
"Resource": [
"arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
]
}
]
}将、和替换为实际值。
YOUR_BUCKETREGIONEVALUATOR_MODEL_ID创建角色:
bash
ROLE_NAME="BedrockEvalRole"
ROLE_ARN=$(aws iam create-role \
--role-name "${ROLE_NAME}" \
--assume-role-policy-document file://trust-policy.json \
--description "Allows Bedrock to run evaluation jobs" \
--query "Role.Arn" --output text)
aws iam put-role-policy \
--role-name "${ROLE_NAME}" \
--policy-name "BedrockEvalPolicy" \
--policy-document file://permissions-policy.jsonStep 5: Configure and Run Eval Job
步骤5:配置并运行评估任务
eval-config.json
eval-config.json
json
{
"automated": {
"datasetMetricConfigs": [
{
"taskType": "General",
"dataset": {
"name": "my-eval-dataset",
"datasetLocation": {
"s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
}
},
"metricNames": [
"Builtin.Helpfulness",
"Builtin.FollowingInstructions",
"Builtin.ProfessionalStyleAndTone",
"Builtin.Relevance",
"Builtin.Completeness",
"Builtin.Correctness",
"my_custom_metric_1",
"my_custom_metric_2"
]
}
],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
{ "modelIdentifier": "EVALUATOR_MODEL_ID" }
]
},
"customMetricConfig": {
"customMetrics": [
{
"customMetricDefinition": {
"metricName": "my_custom_metric_1",
"instructions": "... {{prompt}} ... {{prediction}} ...",
"ratingScale": [
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}
],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
{ "modelIdentifier": "EVALUATOR_MODEL_ID" }
]
}
}
}
}Critical structure notes:
- must be
taskType(not "Generation" or any other value)"General" - Custom metric names must appear in both array AND
metricNamesarraycustomMetrics - appears twice: once at the top level (for built-in metrics) and once inside
evaluatorModelConfig(for custom metrics) — both must specify the same evaluator modelcustomMetricConfig - must be the exact model ID string matching across all configs
modelIdentifier
json
{
"automated": {
"datasetMetricConfigs": [
{
"taskType": "General",
"dataset": {
"name": "my-eval-dataset",
"datasetLocation": {
"s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
}
},
"metricNames": [
"Builtin.Helpfulness",
"Builtin.FollowingInstructions",
"Builtin.ProfessionalStyleAndTone",
"Builtin.Relevance",
"Builtin.Completeness",
"Builtin.Correctness",
"my_custom_metric_1",
"my_custom_metric_2"
]
}
],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
{ "modelIdentifier": "EVALUATOR_MODEL_ID" }
]
},
"customMetricConfig": {
"customMetrics": [
{
"customMetricDefinition": {
"metricName": "my_custom_metric_1",
"instructions": "... {{prompt}} ... {{prediction}} ...",
"ratingScale": [
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}
],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
{ "modelIdentifier": "EVALUATOR_MODEL_ID" }
]
}
}
}
}关键结构说明:
- 必须设置为
taskType(不能是"Generation"或其他值)"General" - 自定义指标名称必须同时出现在数组和
metricNames数组中customMetrics - 出现两次:一次在顶层(用于内置指标),一次在
evaluatorModelConfig内部(用于自定义指标)——两者必须指定相同的评估者模型customMetricConfig - 必须与所有配置中的模型ID字符串完全匹配
modelIdentifier
inference-config.json
inference-config.json
For pre-computed inference, this tells Bedrock that responses are already collected:
json
{
"models": [
{
"precomputedInferenceSource": {
"inferenceSourceIdentifier": "my-app-v1"
}
}
]
}The must match the in your JSONL dataset's .
inferenceSourceIdentifiermodelIdentifiermodelResponses对于预计算推理,该配置告知Bedrock响应已提前收集:
json
{
"models": [
{
"precomputedInferenceSource": {
"inferenceSourceIdentifier": "my-app-v1"
}
}
]
}inferenceSourceIdentifiermodelResponsesmodelIdentifierRunning the Job
运行任务
bash
aws bedrock create-evaluation-job \
--job-name "my-eval-$(date +%Y%m%d-%H%M)" \
--role-arn "${ROLE_ARN}" \
--evaluation-config file://eval-config.json \
--inference-config file://inference-config.json \
--output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
--region us-east-1CLI notes:
- Required params: ,
--job-name,--role-arn,--evaluation-config,--inference-config--output-data-config - Optional: (e.g.,
--application-type)ModelEvaluation - constraint:
--job-name— lowercase + hyphens only, max 63 chars. Must be unique (use timestamps).[a-z0-9](-*[a-z0-9]){0,62} - and
--evaluation-configare document types — must use--inference-configor inline JSON, no shorthand syntaxfile:// - is a structure — supports both inline JSON and shorthand (
--output-data-config)s3Uri=string
bash
aws bedrock create-evaluation-job \
--job-name "my-eval-$(date +%Y%m%d-%H%M)" \
--role-arn "${ROLE_ARN}" \
--evaluation-config file://eval-config.json \
--inference-config file://inference-config.json \
--output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
--region us-east-1CLI注意事项:
- 必填参数: ,
--job-name,--role-arn,--evaluation-config,--inference-config--output-data-config - 可选参数: (例如:
--application-type)ModelEvaluation - 约束:
--job-name— 仅允许小写字母+连字符,最多63个字符。必须唯一(使用时间戳)。[a-z0-9](-*[a-z0-9]){0,62} - 和
--evaluation-config为文档类型——必须使用--inference-config或内联JSON,不支持简写语法file:// - 为结构化参数——支持内联JSON和简写(
--output-data-config)s3Uri=string
Monitoring
监控
bash
undefinedbash
undefinedList evaluation jobs (with optional filters)
列出评估任务(可添加筛选条件)
aws bedrock list-evaluation-jobs --region us-east-1
aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1
aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1
aws bedrock list-evaluation-jobs --region us-east-1
aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1
aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1
Get details for a specific job
获取特定任务的详情
aws bedrock get-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1
--job-identifier "JOB_ARN"
--region us-east-1
aws bedrock get-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1
--job-identifier "JOB_ARN"
--region us-east-1
Cancel a running job
取消运行中的任务
aws bedrock stop-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1
--job-identifier "JOB_ARN"
--region us-east-1
**Job statuses:** `InProgress`, `Completed`, `Failed`, `Stopping`, `Stopped`, `Deleting`
Jobs typically take 5-15 minutes for 30-50 entry datasets. If a job fails, check `failureMessages` in the job details.
---aws bedrock stop-evaluation-job
--job-identifier "JOB_ARN"
--region us-east-1
--job-identifier "JOB_ARN"
--region us-east-1
**任务状态:** `InProgress`, `Completed`, `Failed`, `Stopping`, `Stopped`, `Deleting`
对于包含30-50条条目的数据集,任务通常需要5-15分钟完成。如果任务失败,请检查任务详情中的`failureMessages`。
---Step 6: Parse Results
步骤6:解析结果
S3 Output Directory Structure
S3输出目录结构
Bedrock writes results to a deeply nested path:
s3://YOUR_BUCKET/results/
└── <job-name>/
└── <job-name>/
├── amazon-bedrock-evaluations-permission-check ← empty sentinel
└── <random-id>/
├── custom_metrics/ ← metric definitions (NOT results)
└── models/
└── <model-identifier>/
└── taskTypes/General/datasets/<dataset-name>/
└── <uuid>_output.jsonl ← actual resultsThe job name is repeated twice. The random ID changes every run. Use — do not construct paths manually.
aws s3 syncBedrock会将结果写入深度嵌套的路径:
s3://YOUR_BUCKET/results/
└── <job-name>/
└── <job-name>/
├── amazon-bedrock-evaluations-permission-check ← 空的标记文件
└── <random-id>/
├── custom_metrics/ ← 指标定义(非结果)
└── models/
└── <model-identifier>/
└── taskTypes/General/datasets/<dataset-name>/
└── <uuid>_output.jsonl ← 实际结果任务名称会重复两次。随机ID每次运行都会变化。使用同步——不要手动构造路径。
aws s3 syncDownload Results
下载结果
bash
aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1bash
aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1Result JSONL Format
结果JSONL格式
Each line:
json
{
"automatedEvaluationResult": {
"scores": [
{
"metricName": "Builtin.Helpfulness",
"result": 0.6667,
"evaluatorDetails": [
{
"modelIdentifier": "amazon.nova-pro-v1:0",
"explanation": "The response provides useful information..."
}
]
},
{
"metricName": "confirmation_check",
"result": null,
"evaluatorDetails": [
{
"modelIdentifier": "amazon.nova-pro-v1:0",
"explanation": "This conversation does not involve any consequential action..."
}
]
}
]
},
"inputRecord": {
"prompt": "hello",
"referenceResponse": "",
"modelResponses": [
{ "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
]
}
}- is a number (score) or
result(N/A)null - contains the judge's written reasoning
evaluatorDetails[0].explanation
每一行:
json
{
"automatedEvaluationResult": {
"scores": [
{
"metricName": "Builtin.Helpfulness",
"result": 0.6667,
"evaluatorDetails": [
{
"modelIdentifier": "amazon.nova-pro-v1:0",
"explanation": "The response provides useful information..."
}
]
},
{
"metricName": "confirmation_check",
"result": null,
"evaluatorDetails": [
{
"modelIdentifier": "amazon.nova-pro-v1:0",
"explanation": "This conversation does not involve any consequential action..."
}
]
}
]
},
"inputRecord": {
"prompt": "hello",
"referenceResponse": "",
"modelResponses": [
{ "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
]
}
}- 为数值(分数)或
result(不适用)null - 包含评判者的书面推理过程
evaluatorDetails[0].explanation
Parsing and Aggregation
解析与聚合
typescript
interface PromptResult {
prompt: string;
category: string;
modelResponse: string;
scores: Record<string, {
score: string;
reasoning?: string;
rawScore?: number;
}>;
}
for (const s of entry.automatedEvaluationResult.scores) {
scores[s.metricName] = {
score: s.result === null ? "N/A" : String(s.result),
reasoning: s.evaluatorDetails?.[0]?.explanation,
rawScore: typeof s.result === "number" ? s.result : undefined,
};
}Aggregation approach:
- Overall averages per metric — exclude N/A entries
- Per-category breakdown — group by category field, compute averages within each
- Low-score alerts — flag entries below threshold (built-in < 0.5, custom <= 0)
Low-score alert format:
[Builtin.Relevance] score=0.50 | "hello..."
Reason: The response does not directly address the greeting...
[confirmation_check] score=0.00 | "User: proceed with X..."
Reason: The assistant executed the action without asking for confirmation...typescript
interface PromptResult {
prompt: string;
category: string;
modelResponse: string;
scores: Record<string, {
score: string;
reasoning?: string;
rawScore?: number;
}>;
}
for (const s of entry.automatedEvaluationResult.scores) {
scores[s.metricName] = {
score: s.result === null ? "N/A" : String(s.result),
reasoning: s.evaluatorDetails?.[0]?.explanation,
rawScore: typeof s.result === "number" ? s.result : undefined,
};
}聚合方法:
- 各指标的总体平均值 — 排除不适用条目
- 按类别细分 — 按category字段分组,计算每个类别内的平均值
- 低分告警 — 标记低于阈值的条目(内置指标<0.5,自定义指标<=0)
低分告警格式:
[Builtin.Relevance] score=0.50 | "hello..."
Reason: The response does not directly address the greeting...
[confirmation_check] score=0.00 | "User: proceed with X..."
Reason: The assistant executed the action without asking for confirmation...Step 7: Eval-Fix-Reeval Loop
步骤7:评估-修复-重新评估循环
Common Fixes
常见修复措施
| Finding | Fix |
|---|---|
| Low brevity scores | Add hard constraint: "Respond in no more than 3 sentences." |
| Low confirmation_check | Add: "Before executing, summarize details and ask for confirmation." |
| Low missing_info_followup | Add: "If any required field is missing, ask for it. Do not assume." |
| Low tone on negative outcomes | Add empathy instructions for bad-news scenarios |
| Low Completeness on simple prompts | Metric/data issue — add |
| 发现问题 | 修复方案 |
|---|---|
| 简洁性分数低 | 添加硬性约束:"Respond in no more than 3 sentences." |
| confirmation_check分数低 | 添加:"Before executing, summarize details and ask for confirmation." |
| 缺失信息跟进分数低 | 添加:"If any required field is missing, ask for it. Do not assume." |
| 负面结果的语气分数低 | 为坏消息场景添加共情指令 |
| 简单提示词的完整性分数低 | 指标/数据问题——添加 |
Metric Refinement
指标优化
- High N/A rates (>60%) — metric too narrowly scoped. Split dataset or adjust scope.
- All-high scores — instructions too lenient. Add specific failure criteria.
- Inconsistent scoring — instructions ambiguous. Add concrete examples per rating level.
- 高不适用率(>60%)——指标范围过窄。拆分数据集或调整指标范围。
- 全高分——指令过于宽松。添加具体的失败判定标准。
- 评分不一致——指令模糊。为每个评分等级添加具体示例。
Run Comparison
运行对比
Run 1 (baseline): response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes): response_brevity avg=0.85, custom_tone avg=0.90Track scores over time. The pipeline's value comes from repeated measurement.
Run 1 (baseline): response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes): response_brevity avg=0.85, custom_tone avg=0.90随时间跟踪分数变化。流水线的价值在于反复测量。
Gotchas
注意事项
-
must be
taskType— not "Generation" or any other value. The job fails silently with other values."General" -
Custom metric names in BOTH places — must appear inarray AND
metricNamesarray. Missing fromcustomMetrics= silently ignored. Missing frommetricNames= job fails.customMetrics -
result means N/A, not 0 — when the judge determines a metric doesn't apply, Bedrock records
null:nulltypescript// WRONG — treats N/A as 0 const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length; // RIGHT — excludes N/A from average const numericScores = scores.filter((s): s is number => s !== null); const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length; -
appears twice — once at top level (built-in metrics), once inside
evaluatorModelConfig(custom metrics). Omitting either causes those metrics to fail.customMetricConfig -
must match exactly — the string in JSONL
modelIdentifiermust be character-for-character identical tomodelResponsesin inference-config.json. Mismatch = model mapping error.inferenceSourceIdentifier -
AWS CLI 2.33+ required — older versions silently dropand
customMetricConfig. Job creation succeeds but the job fails. Always checkprecomputedInferenceSource.aws --version -
Job names: lowercase + hyphens, max 63 chars — pattern:. Must be unique across all jobs. Use timestamps:
[a-z0-9](-*[a-z0-9]){0,62}.--job-name "my-eval-$(date +%Y%m%d-%H%M)" -
S3 output is deeply nested —. Use
<prefix>/<job-name>/<job-name>/<random-id>/models/...and search foraws s3 sync. Do not construct paths manually._output.jsonl -
improves Correctness/Completeness — empty string is valid, but providing reference responses gives the judge a baseline for comparison.
referenceResponse -
tag leakage (model-specific) — some models (e.g., Amazon Nova Lite) may leak
<thinking>blocks into responses. If present, strip before writing JSONL:<thinking>...</thinking>typescriptconst clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim(); -
us-east-1 S3 bucket creation — do NOT passfor
LocationConstraint. Other regions require it.us-east-1
-
必须设置为
taskType— 不能是"Generation"或其他值。设置为其他值会导致任务静默失败。"General" -
自定义指标名称必须出现在两个位置 — 必须同时出现在数组和
metricNames数组中。未出现在customMetrics中会被静默忽略。未出现在metricNames中会导致任务失败。customMetrics -
结果表示不适用,而非0分 — 当评判者判定指标不适用时,Bedrock会记录
null:nulltypescript// 错误 — 将不适用视为0分 const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length; // 正确 — 排除不适用条目计算平均值 const numericScores = scores.filter((s): s is number => s !== null); const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length; -
出现两次 — 一次在顶层(用于内置指标),一次在
evaluatorModelConfig内部(用于自定义指标)。省略任何一处都会导致对应指标失败。customMetricConfig -
必须完全匹配 — JSONL的
modelIdentifier中的字符串必须与inference-config.json中的modelResponses完全一致。不匹配会导致模型映射错误。inferenceSourceIdentifier -
需要AWS CLI 2.33+版本 — 旧版本会静默丢弃和
customMetricConfig。任务创建会成功,但任务会失败。请始终检查precomputedInferenceSource。aws --version -
任务名称:小写字母+连字符,最多63个字符 — 格式:。所有任务中必须唯一。使用时间戳:
[a-z0-9](-*[a-z0-9]){0,62}。--job-name "my-eval-$(date +%Y%m%d-%H%M)" -
S3输出路径深度嵌套 —。使用
<prefix>/<job-name>/<job-name>/<random-id>/models/...同步并搜索aws s3 sync。不要手动构造路径。_output.jsonl -
提升正确性/完整性指标效果 — 空字符串是合法的,但提供参考响应能为评判者提供对比基准。
referenceResponse -
标签泄露(模型特定) — 部分模型(例如Amazon Nova Lite)可能会在响应中泄露
<thinking>块。如果存在此类内容,写入JSONL前需去除:<thinking>...</thinking>typescriptconst clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim(); -
us-east-1区域S3存储桶创建 — 不要为区域传递
us-east-1。其他区域需要该参数。LocationConstraint
Cost Estimation
成本估算
Formula:
Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_priceExample: 30 prompts, 10 metrics, Nova Pro judge:
- Response collection (Nova Lite): ~$0.02
- Evaluation job (Nova Pro): ~$0.58
- Total per run: ~$0.61
Scaling: Cost is linear with prompts and metrics. 100 prompts x 10 metrics ≈ $5. Judge cost dominates at ~95%. Adding 1 custom metric adds ~$0.06/run (30 prompts, Nova Pro).
公式:
Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price示例: 30个提示词,10个指标,使用Nova Pro作为评判者:
- 响应收集(Nova Lite):~$0.02
- 评估任务(Nova Pro):~$0.58
- 每次运行总成本:~$0.61
扩展: 成本与提示词数量和指标数量呈线性关系。100个提示词×10个指标≈$5。评判者成本占比约95%。添加1个自定义指标会使每次运行成本增加约$0.06(30个提示词,Nova Pro)。
References
参考资料
- Model Evaluation Metrics — all 11 built-in metrics
- Custom Metrics Prompt Formats — , template variables, constraints
metricName - Prompt Datasets for Judge Evaluation — dataset JSONL format
- CreateEvaluationJob API Reference — full API spec
- AWS CLI create-evaluation-job — CLI command reference
- Amazon Bedrock Pricing — model pricing
- Model Evaluation Metrics — 所有11个内置指标
- Custom Metrics Prompt Formats — 、模板变量、约束条件
metricName - Prompt Datasets for Judge Evaluation — 数据集JSONL格式
- CreateEvaluationJob API Reference — 完整API规范
- AWS CLI create-evaluation-job — CLI命令参考
- Amazon Bedrock Pricing — 模型定价