agents-debug

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

debug

调试

Diagnose why your AgentCore agent or environment isn't working correctly.
诊断你的AgentCore Agent或环境无法正常工作的原因。

When to use

适用场景

  • Your agent is returning wrong answers or errors
  • Tool calls are failing or timing out
  • Agent works locally but fails after deploying
  • Logs aren't showing up in CloudWatch
  • The AgentCore CLI isn't working or environment seems broken
  • agentcore
    command not found or prerequisites are missing
Do NOT use for:
  • Deploy failures (CDK errors, IAM during deploy) → use
    agents-deploy
  • Scaffolding a new project → use
    agents-get-started
  • Measuring quality or setting up monitoring → use
    agents-optimize
  • Agent返回错误答案或报错
  • 工具调用失败或超时
  • Agent在本地运行正常,但部署后失败
  • 日志未出现在CloudWatch中
  • AgentCore CLI无法工作或环境似乎损坏
  • agentcore
    命令未找到或缺少先决条件
请勿用于:
  • 部署失败(CDK错误、部署期间的IAM问题)→ 使用
    agents-deploy
  • 搭建新项目脚手架 → 使用
    agents-get-started
  • 衡量质量或设置监控 → 使用
    agents-optimize

Input

输入

$ARGUMENTS
is optional:
/agents-debug                      # interactive — describe what's wrong
/agents-debug traces               # read and explain recent traces
/agents-debug logs                 # search recent logs for errors
/agents-debug memory               # diagnose memory recall issues specifically
/agents-debug doctor               # check environment prerequisites
$ARGUMENTS
为可选参数:
/agents-debug                      # 交互式模式——描述具体问题
/agents-debug traces               # 读取并解释最近的追踪数据
/agents-debug logs                 # 搜索最近的日志以查找错误
/agents-debug memory               # 专门诊断内存召回问题
/agents-debug doctor               # 检查环境先决条件

Process

流程

Step 0: Determine problem type

步骤0:确定问题类型

If the developer's issue is about the CLI itself (command not found, prerequisites, environment setup), load
references/doctor.md
and follow its diagnostic checklist.
If the issue is about agent behavior (wrong answers, errors, timeouts, tool failures), continue with Step 1 below.
如果开发者的问题是关于CLI本身(命令未找到、先决条件缺失、环境设置问题),加载
references/doctor.md
并遵循其诊断清单。
如果问题是关于Agent行为(错误答案、报错、超时、工具失败),请继续执行以下步骤1。

Step 1: Verify CLI version

步骤1:验证CLI版本

Run
agentcore --version
. This skill requires v0.9.0 or later. If the version is older, tell the developer to run
agentcore update
before proceeding.
运行
agentcore --version
。此技能需要v0.9.0或更高版本。如果版本较旧,请告知开发者先运行
agentcore update
再继续。

Step 2: Understand the symptom

步骤2:了解症状

Ask (or infer from context):
"What's happening?
  1. The agent returns an error message
  2. The agent returns a wrong or unhelpful answer
  3. A specific tool call is failing
  4. Memory isn't working (agent doesn't remember things)
  5. The agent is slow or timing out
  6. I want to understand what the agent did in a specific session"
询问(或从上下文推断):
"具体发生了什么?
  1. Agent返回错误消息
  2. Agent返回错误或无用的答案
  3. 特定工具调用失败
  4. 内存无法正常工作(Agent无法记住信息)
  5. Agent运行缓慢或超时
  6. 我想了解Agent在特定会话中的行为"

Step 3: Read traces and logs automatically

步骤3:自动读取追踪和日志

Don't ask the developer to paste logs — read them directly.
bash
undefined
不要让开发者粘贴日志——直接读取它们。
bash
undefined

List recent traces

列出最近的追踪数据

agentcore traces list --runtime <AgentName> --since 1h
agentcore traces list --runtime <AgentName> --since 1h

Get the most recent trace ID

获取最新的追踪ID

agentcore traces list --runtime <AgentName> --since 1h --limit 1
agentcore traces list --runtime <AgentName> --since 1h --limit 1

Download and read the trace

下载并读取追踪数据

agentcore traces get <traceId> --runtime <AgentName>
agentcore traces get <traceId> --runtime <AgentName>

Search logs for errors

搜索日志中的错误

agentcore logs --runtime <AgentName> --since 1h --level error
agentcore logs --runtime <AgentName> --since 1h --level error

Search logs for a specific pattern

搜索日志中的特定模式

agentcore logs --runtime <AgentName> --since 2h --query "timeout" agentcore logs --runtime <AgentName> --since 2h --query "model access"

**Important:** CloudWatch put-to-get latency is **~10 seconds end-to-end** — that's the delay from when a span is emitted to when it's readable by `agentcore traces get` or `agentcore run eval`. There is **no separate "trace ingested but eval not ready yet" window**; the same ingestion step unlocks both paths. Older skills and docs said 30–60s for traces and 2–5 minutes for evals — both are stale. If you just invoked the agent, wait ~15 seconds and both trace reads and evals will work.

Read `agentcore/agentcore.json` to get the agent name if not provided.
agentcore logs --runtime <AgentName> --since 2h --query "timeout" agentcore logs --runtime <AgentName> --since 2h --query "model access"

**重要提示:** CloudWatch的写入到读取延迟约为**10秒端到端**——即从Span发出到`agentcore traces get`或`agentcore run eval`可读取的延迟。不存在单独的“追踪已摄入但评估未就绪”窗口;同一摄入步骤同时解锁两条路径。旧版技能和文档称追踪需要30–60秒、评估需要2–5分钟——这些信息已过时。如果你刚调用了Agent,请等待约15秒,追踪读取和评估功能即可正常使用。

如果未提供Agent名称,请读取`agentcore/agentcore.json`获取。

Step 4: Diagnose by symptom

步骤4:按症状诊断



Symptom: "model access denied" or model error

症状:"model access denied"或模型错误

Most common cause: The model isn't enabled in the Bedrock console for your region.
Fix:
  1. Go to AWS Console → Amazon Bedrock → Model access
  2. Enable the model your agent uses
  3. Wait 1–2 minutes for access to propagate
Second cause: The execution role is missing
bedrock:InvokeModel
.
Check:
bash
aws iam simulate-principal-policy \
  --policy-source-arn $(agentcore status --json | jq -r '.runtimes[0].executionRoleArn') \
  --action-names bedrock:InvokeModel \
  --resource-arns "arn:aws:bedrock:*::foundation-model/*"
Third cause: Cross-region inference profile requires model access in all regions.
Model IDs starting with a geographic prefix are cross-region inference profiles that route requests within that geography:
PrefixGeographyExample destination regions
us.
United Statesus-east-1, us-east-2, us-west-2
eu.
Europeeu-central-1, eu-west-1, eu-west-2, eu-west-3
apac.
Asia Pacificap-northeast-1, ap-southeast-1, ap-southeast-2, ap-south-1
global.
All commercial regions worldwideAll supported regions
The AgentCore CLI scaffolds
global.
by default (e.g.,
global.anthropic.claude-sonnet-4-5-20250929-v1:0
). All prefixes require model access enabled in every destination region the profile covers. For
us.
profiles, enable in all US regions; for
eu.
, all EU regions; for
global.
, all supported regions. Not all models support all prefixes —
global.
is currently available for select models only. Use
global.
for maximum throughput when available, or a geographic prefix when data residency requirements constrain where inference can run. Check the Bedrock inference profiles docs for current model × prefix availability.

最常见原因: 模型未在对应区域的Bedrock控制台中启用。
修复方法:
  1. 进入AWS控制台 → Amazon Bedrock → 模型访问
  2. 启用你的Agent使用的模型
  3. 等待1–2分钟让访问权限生效
第二个原因: 执行角色缺少
bedrock:InvokeModel
权限。
检查:
bash
aws iam simulate-principal-policy \
  --policy-source-arn $(agentcore status --json | jq -r '.runtimes[0].executionRoleArn') \
  --action-names bedrock:InvokeModel \
  --resource-arns "arn:aws:bedrock:*::foundation-model/*"
第三个原因: 跨区域推理配置文件需要在所有区域都启用模型访问。
以地理前缀开头的模型ID是跨区域推理配置文件,会将请求路由到该地理范围内的区域:
前缀地理范围示例目标区域
us.
美国us-east-1, us-east-2, us-west-2
eu.
欧洲eu-central-1, eu-west-1, eu-west-2, eu-west-3
apac.
亚太地区ap-northeast-1, ap-southeast-1, ap-southeast-2, ap-south-1
global.
全球所有商用区域所有支持的区域
AgentCore CLI默认使用
global.
前缀(例如
global.anthropic.claude-sonnet-4-5-20250929-v1:0
)。所有前缀都要求在配置文件覆盖的每个目标区域启用模型访问。对于
us.
配置文件,需在所有美国区域启用;对于
eu.
,需在所有欧洲区域启用;对于
global.
,需在所有支持的区域启用。并非所有模型都支持所有前缀——目前只有部分模型支持
global.
前缀。当可用时,使用
global.
以获得最大吞吐量;当数据驻留要求限制推理运行位置时,使用地理前缀。请查看Bedrock推理配置文件文档了解当前模型×前缀的可用性。

Symptom: Tool call failing

症状:工具调用失败

Step 1: Find the failing tool call in the trace:
bash
agentcore traces get <traceId> --runtime <AgentName>
Look for tool call entries with error status.
Step 2: Check the gateway status:
bash
agentcore status --type gateway
agentcore fetch access --name <AgentName> --type agent
Step 3: Common tool call failures:
Gateway URL not set (local dev): The
AGENTCORE_GATEWAY_*_URL
env var is only set after deploy. In
agentcore dev
, gateway tools aren't available. This is expected — the agent should handle this gracefully.
Auth failure on tool call:
bash
agentcore logs --runtime <AgentName> --since 1h --query "auth"
Check that the credential is configured correctly:
agentcore status --type credential
Lambda function error: The Lambda itself is failing. Check Lambda logs directly:
bash
aws logs tail /aws/lambda/<function-name> --since 1h
Policy denial: If a policy engine is attached, check policy decision logs:
bash
agentcore logs --runtime <AgentName> --since 1h --query "policy"
agentcore status --type policy-engine

步骤1: 在追踪数据中找到失败的工具调用:
bash
agentcore traces get <traceId> --runtime <AgentName>
查找带有错误状态的工具调用条目。
步骤2: 检查网关状态:
bash
agentcore status --type gateway
agentcore fetch access --name <AgentName> --type agent
步骤3: 常见工具调用失败原因:
网关URL未设置(本地开发):
AGENTCORE_GATEWAY_*_URL
环境变量仅在部署后设置。在
agentcore dev
模式下,网关工具不可用。这是预期行为——Agent应能优雅处理此情况。
工具调用认证失败:
bash
agentcore logs --runtime <AgentName> --since 1h --query "auth"
检查凭证是否配置正确:
agentcore status --type credential
Lambda函数错误: Lambda本身执行失败。直接检查Lambda日志:
bash
aws logs tail /aws/lambda/<function-name> --since 1h
策略拒绝: 如果附加了策略引擎,请检查策略决策日志:
bash
agentcore logs --runtime <AgentName> --since 1h --query "policy"
agentcore status --type policy-engine

Symptom: Wrong or unhelpful answers

症状:返回错误或无用的答案

Step 1: Read the trace to see the agent's reasoning:
bash
agentcore traces get <traceId> --runtime <AgentName>
The trace shows the model's reasoning steps, tool calls made, and the final response. Look for:
  • Did the agent use the right tools?
  • Did the tool calls return the expected data?
  • Is the system prompt providing the right context?
Step 2: Check if memory is involved: If the agent should be using memory context but isn't, see the "Symptom: Memory not persisting" section later in this skill, or load
references/doctor.md
if this is an environment issue.
Step 3: Common causes:
  • System prompt is too vague or missing key context
  • Agent isn't calling the right tools (tool descriptions need improvement)
  • Tool is returning unexpected data format
  • Model ID is wrong for the task (e.g., using a smaller model for complex reasoning)

步骤1: 读取追踪数据查看Agent的推理过程:
bash
agentcore traces get <traceId> --runtime <AgentName>
追踪数据显示模型的推理步骤、调用的工具以及最终响应。检查:
  • Agent是否使用了正确的工具?
  • 工具调用是否返回了预期数据?
  • 系统提示是否提供了正确的上下文?
步骤2: 检查是否涉及内存: 如果Agent应使用内存上下文但未使用,请查看本技能后面的“症状:内存未持久化”部分;如果是环境问题,请加载
references/doctor.md
步骤3: 常见原因:
  • 系统提示过于模糊或缺少关键上下文
  • Agent未调用正确的工具(工具描述需要改进)
  • 工具返回意外的数据格式
  • 模型ID不适合当前任务(例如使用小型模型处理复杂推理)

Symptom: Memory not working

症状:内存无法正常工作

Memory not persisting across sessions (LTM):
  1. Verify LTM strategies are configured (SEMANTIC or USER_PREFERENCE):
bash
agentcore status --type memory --json | jq '.memories[].strategies'
  1. Wait 5–30 seconds after a session ends — LTM extraction is async. The agent must finish its session before facts are extracted.
  2. Use UUIDs (v4) for session IDs — the platform requires a minimum of 33 characters. Short IDs like "session-1" cause LTM to fail silently.
    agentcore invoke
    generates compliant IDs by default.
  3. Verify the memory resource is ACTIVE:
bash
agentcore status --type memory
Memory not loading at session start:
  1. Check the
    MEMORY_*_ID
    env var is set:
bash
agentcore status --type memory --json | jq '.memories[].id'
  1. Verify the
    actor_id
    is consistent across sessions — memory is scoped per actor.
  2. Check the namespace paths in your retrieval config match the namespaces used when writing.

跨会话内存未持久化(LTM):
  1. 验证是否配置了LTM策略(SEMANTIC或USER_PREFERENCE):
bash
agentcore status --type memory --json | jq '.memories[].strategies'
  1. 会话结束后等待5–30秒——LTM提取是异步操作。Agent必须完成会话后才能提取事实。
  2. 对会话ID使用UUID(v4)——平台要求至少33个字符。像"session-1"这样的短ID会导致LTM静默失败。
    agentcore invoke
    默认生成符合要求的ID。
  3. 验证内存资源处于ACTIVE状态:
bash
agentcore status --type memory
会话启动时内存未加载:
  1. 检查
    MEMORY_*_ID
    环境变量是否已设置:
bash
agentcore status --type memory --json | jq '.memories[].id'
  1. 验证跨会话的
    actor_id
    是否一致——内存是按actor范围划分的。
  2. 检查检索配置中的命名空间路径与写入时使用的命名空间是否匹配。

Symptom: Agent timeout

症状:Agent超时

Step 1: Check the trace for where time is being spent:
bash
agentcore traces get <traceId> --runtime <AgentName>
Look for long-running steps — model calls, tool calls, memory operations.
Step 2: Common timeout causes:
Slow agent initialization: If the first invocation after an idle period is slow but subsequent requests are fast, the agent is spending too much time initializing. Check for heavy imports at module level, database connections in global scope, or MCP client initialization during startup. Move expensive setup into the request handler or use lazy initialization. See the
agents-harden
skill for optimization guidance.
Model call timeout: The model is taking too long. Consider using a faster model for time-sensitive operations (e.g., Haiku instead of Sonnet for simple tasks).
Tool call timeout: The Lambda or external API is slow. Check the tool's own logs.
Memory retrieval timeout: Semantic search can be slow for large memory stores. Consider reducing
top_k
in your retrieval config.
VPC connectivity issue: If the agent is in a VPC, check security group rules and route tables. See
agents-build
(loads
references/vpc.md
) for VPC-specific debugging.

步骤1: 检查追踪数据查看时间消耗位置:
bash
agentcore traces get <traceId> --runtime <AgentName>
查找长时间运行的步骤——模型调用、工具调用、内存操作。
步骤2: 常见超时原因:
Agent初始化缓慢: 如果空闲后的第一次调用缓慢,但后续请求快速,说明Agent初始化耗时过长。检查模块级别的重导入、全局作用域中的数据库连接或启动期间的MCP客户端初始化。将昂贵的设置移至请求处理程序中或使用延迟初始化。请查看
agents-harden
技能获取优化指导。
模型调用超时: 模型响应过慢。对于时间敏感的操作,考虑使用更快的模型(例如使用Haiku代替Sonnet处理简单任务)。
工具调用超时: Lambda或外部API响应缓慢。检查工具自身的日志。
内存检索超时: 对于大型内存存储,语义搜索可能较慢。考虑减少检索配置中的
top_k
值。
VPC连接问题: 如果Agent在VPC中,请检查安全组规则和路由表。请查看
agents-build
(加载
references/vpc.md
)获取VPC特定的调试方法。

Symptom:
ServiceQuotaExceededException: maxVms limit exceeded
(despite low observed concurrency)

症状:
ServiceQuotaExceededException: maxVms limit exceeded
(尽管观察到的并发量很低)

Your CloudWatch "concurrent sessions" metric shows modest numbers (maybe 30–50) but
InvokeAgentRuntime
calls return
ServiceQuotaExceededException: maxVms limit exceeded
.
What's actually happening: CloudWatch's concurrent-sessions metric is not the same as live microVM count. The
maxVms
quota counts all environments your account has active — including ones that finished their invocation but haven't been reclaimed yet. Idle-but-not-yet-reclaimed environments count against the quota until
idleRuntimeSessionTimeout
expires (default 900 seconds / 15 minutes) or you explicitly stop them.
If your code uses a new session ID per request and doesn't call
StopRuntimeSession
, every request leaves an environment sitting idle for 15 minutes counting against the quota.
Fix order (try in this order before requesting a quota increase):
  1. Call
    StopRuntimeSession
    after each logical request completes.
    If you're not going to send more requests on this session, stop it explicitly.
    python
    client.stop_runtime_session(
        agentRuntimeArn=runtime_arn,
        runtimeSessionId=session_id,
    )
  2. Reuse session IDs across related requests. If a user interaction produces multiple backend calls, route them to the same session instead of generating a new session ID per call.
  3. Lower
    idleRuntimeSessionTimeout
    .
    If your sessions are short-lived and you can't add
    StopRuntimeSession
    everywhere, lower the timeout by editing the runtime's
    lifecycleConfiguration
    in
    agentcore/agentcore.json
    and running
    agentcore deploy
    .
  4. Only after the above, request a quota increase. See
    agents-harden
    (loads
    references/limits.md
    ) — request it through the Service Quotas console (Amazon Bedrock AgentCore), not by filing a support ticket directly.
See
agents-harden
Session lifecycle management section for the full pattern.

你的CloudWatch“并发会话”指标显示数值适中(可能30–50),但
InvokeAgentRuntime
调用返回
ServiceQuotaExceededException: maxVms limit exceeded
实际情况: CloudWatch的并发会话指标与实时微VM数量不同。
maxVms
配额统计账户中所有活跃的环境——包括已完成调用但尚未被回收的环境。空闲但尚未回收的环境在
idleRuntimeSessionTimeout
过期(默认900秒/15分钟)或你显式停止它们之前,都会占用配额。
如果你的代码每个请求使用新的会话ID且未调用
StopRuntimeSession
,每个请求都会留下一个环境空闲15分钟,占用配额。
修复顺序(在请求配额增加前按此顺序尝试):
  1. 每个逻辑请求完成后调用
    StopRuntimeSession
    如果不会在此会话上发送更多请求,请显式停止它。
    python
    client.stop_runtime_session(
        agentRuntimeArn=runtime_arn,
        runtimeSessionId=session_id,
    )
  2. 跨相关请求重用会话ID。 如果用户交互产生多个后端调用,将它们路由到同一个会话,而不是每个调用生成新的会话ID。
  3. 降低
    idleRuntimeSessionTimeout
    如果你的会话是短期的且无法在所有地方添加
    StopRuntimeSession
    ,请编辑
    agentcore/agentcore.json
    中运行时的
    lifecycleConfiguration
    并运行
    agentcore deploy
    来降低超时时间。
  4. 仅在完成上述步骤后,请求配额增加。 请查看
    agents-harden
    (加载
    references/limits.md
    )——通过Service Quotas控制台(Amazon Bedrock AgentCore)请求,不要直接提交支持工单。
请查看
agents-harden
技能的会话生命周期管理部分获取完整模式。

Symptom: 424 Failed Dependency on invoke

症状:调用时出现424 Failed Dependency

This usually means the agent container failed to start or crashed during initialization.
Step 1: Check the agent logs for startup errors:
bash
agentcore logs --runtime <AgentName> --since 30m --level error
Step 2: Common causes:
Missing Python dependency: The agent code imports a package not in
pyproject.toml
. The container starts but crashes on first request. Fix: add the dependency and redeploy.
Entrypoint crash: The
main.py
throws an exception during import or
app.run()
. Check logs for the traceback.
Container image pull failure: If using Container build, the ECR image may not exist or the execution role lacks
ecr:BatchGetImage
. Check:
bash
agentcore status --runtime <AgentName> --json
Memory resource not ACTIVE: If the agent code assumes memory is available but the memory resource is still in CREATING state, the entrypoint may fail. Check:
bash
agentcore status --type memory
Initialization timeout: The agent takes too long to be ready for its first request — heavy imports at module level, synchronous database connections, or MCP client initialization during startup can exceed the service's health-check window. The symptom looks like a 424 on the first invoke but healthy on subsequent ones. Fix: move expensive setup out of module level, use lazy initialization, or warm the agent before production traffic. See
agents-harden
Initialization time section for patterns.

这通常意味着Agent容器启动失败或初始化期间崩溃。
步骤1: 检查Agent日志中的启动错误:
bash
agentcore logs --runtime <AgentName> --since 30m --level error
步骤2: 常见原因:
缺少Python依赖: Agent代码导入了
pyproject.toml
中未列出的包。容器启动但在第一次请求时崩溃。修复方法:添加依赖并重新部署。
入口点崩溃:
main.py
在导入或
app.run()
期间抛出异常。查看日志中的回溯信息。
容器镜像拉取失败: 如果使用容器构建,ECR镜像可能不存在或执行角色缺少
ecr:BatchGetImage
权限。检查:
bash
agentcore status --runtime <AgentName> --json
内存资源未处于ACTIVE状态: 如果Agent代码假设内存可用但内存资源仍处于CREATING状态,入口点可能失败。检查:
bash
agentcore status --type memory
初始化超时: Agent在首次请求前准备就绪耗时过长——模块级别的重导入、同步数据库连接或启动期间的MCP客户端初始化可能超过服务的健康检查窗口。症状表现为首次调用出现424错误,但后续调用正常。修复方法:将昂贵的设置移出模块级别,使用延迟初始化,或在生产流量前预热Agent。请查看
agents-harden
技能的初始化时间部分获取模式。

Symptom: Local invocations fail with connection-refused / exit code 7

症状:本地调用失败,出现connection-refused / exit code 7

Usually not an agent bug — the dev server is on a different port than you expect.
Default ports
agentcore dev
binds:
ProtocolDefault
HTTP8080
MCP8000
A2A9000
When the default is occupied (second dev session, a lingering process from a previous run, another service on 8080), the CLI auto-increments silently: 8080 → 8081 → 8082. A test harness or
curl
script hardcoded to 8080 will get
Connection refused
(curl exit code 7) while the agent is running fine on 8082.
Diagnose in this order:
  1. Read the CLI banner that
    agentcore dev
    prints — it shows the actual bound port and URL. This is always the source of truth.
  2. If the banner is gone (terminal cleared, running in background), check the log file:
    bash
    tail -20 agentcore/.cli/logs/dev/*.log
  3. Or find the process directly:
    bash
    # macOS / Linux
    ps aux | grep -E 'agentcore dev|uvicorn' | grep -v grep
    lsof -iTCP -sTCP:LISTEN -n -P | grep -E '8080|8081|8082|8000|9000'
Fix options:
  • Pin the port explicitly:
    agentcore dev --port 8080
  • Kill the process squatting on the default:
    lsof -tiTCP:8080 -sTCP:LISTEN | xargs kill
  • Update the hardcoded port in your test harness to read from the CLI output or from an env var
This is also a common source of "works locally one day, fails the next" reports — the port shifted between runs.

通常不是Agent bug——开发服务器使用的端口与你预期的不同。
agentcore dev
绑定的默认端口:
协议默认端口
HTTP8080
MCP8000
A2A9000
当默认端口被占用时(第二个开发会话、上一次运行的残留进程、其他服务占用8080),CLI会自动递增端口:8080 → 8081 → 8082。硬编码为8080的测试工具或
curl
脚本会收到
Connection refused
(curl退出码7),而Agent实际上在8082端口正常运行。
按以下顺序诊断:
  1. 查看
    agentcore dev
    打印的CLI横幅——它显示实际绑定的端口和URL。这始终是最准确的信息来源。
  2. 如果横幅已消失(终端已清空、在后台运行),请查看日志文件:
    bash
    tail -20 agentcore/.cli/logs/dev/*.log
  3. 或直接查找进程:
    bash
    # macOS / Linux
    ps aux | grep -E 'agentcore dev|uvicorn' | grep -v grep
    lsof -iTCP -sTCP:LISTEN -n -P | grep -E '8080|8081|8082|8000|9000'
修复选项:
  • 显式固定端口:
    agentcore dev --port 8080
  • 占用默认端口的进程:
    lsof -tiTCP:8080 -sTCP:LISTEN | xargs kill
  • 更新测试工具中的硬编码端口,改为读取CLI输出或环境变量
这也是“某天本地运行正常,第二天失败”报告的常见原因——端口在两次运行之间发生了变化。

Symptom: Gateway tool calls failing with auth errors

症状:网关工具调用因认证错误失败

Step 1: Verify the auth type matches the target type. This is the most common gateway error — using the wrong outbound auth for the target:
Target typeValid outbound auth
mcp-server
none
,
oauth
, or IAM (SigV4 via API)
lambda-function-arn
IAM only (automatic)
open-api-schema
oauth
or
api-key
(required)
api-gateway
none
,
api-key
, or IAM
smithy-model
IAM or
oauth
Step 2: Check for expired OAuth tokens. If the gateway target uses OAuth, the access token may have expired. Look for auth-related errors:
bash
agentcore logs --runtime <AgentName> --since 1h --query "auth"
agentcore logs --runtime <AgentName> --since 1h --query "401"
agentcore logs --runtime <AgentName> --since 1h --query "403"
If tokens are expiring, verify the OAuth credential provider's token endpoint is reachable and the client credentials are still valid. For MCP server targets with OAuth, the gateway handles token refresh automatically — if it's failing, the credential provider config may be wrong.
Step 3: Check the credential is configured:
bash
agentcore status --type credential
agentcore status --type gateway --json

步骤1: 验证认证类型与目标类型匹配。这是最常见的网关错误——对目标使用了错误的出站认证:
目标类型有效出站认证
mcp-server
none
oauth
或IAM(通过API的SigV4)
lambda-function-arn
仅IAM(自动)
open-api-schema
oauth
api-key
(必填)
api-gateway
none
api-key
或IAM
smithy-model
IAM或
oauth
步骤2: 检查OAuth令牌是否过期。如果网关目标使用OAuth,访问令牌可能已过期。查找与认证相关的错误:
bash
agentcore logs --runtime <AgentName> --since 1h --query "auth"
agentcore logs --runtime <AgentName> --since 1h --query "401"
agentcore logs --runtime <AgentName> --since 1h --query "403"
如果令牌过期,请验证OAuth凭证提供商的令牌端点是否可达,且客户端凭证仍然有效。对于使用OAuth的MCP服务器目标,网关会自动处理令牌刷新——如果失败,可能是凭证提供商配置错误。
步骤3: 检查凭证是否已配置:
bash
agentcore status --type credential
agentcore status --type gateway --json

Symptom: No traces appearing

症状:没有追踪数据显示

Wait ~15 seconds — there's a short delay (typically ~10s) between invocation and trace availability.
If still no traces after ~30 seconds:
  1. Verify observability was enabled when the agent was deployed
  2. Check the agent was actually invoked:
    agentcore logs --runtime <AgentName> --since 1h
  3. Check CloudWatch permissions on the execution role

等待约15秒——调用与追踪数据可用之间存在短暂延迟(通常约10秒)。
如果约30秒后仍无追踪数据:
  1. 验证部署Agent时是否启用了可观测性
  2. 检查Agent是否确实被调用:
    agentcore logs --runtime <AgentName> --since 1h
  3. 检查执行角色的CloudWatch权限

Symptom: CloudWatch logs not appearing

症状:CloudWatch日志未显示

This is the most common observability issue, especially for Container/Docker builds.
AgentCore doesn't capture raw stdout. It uses OpenTelemetry to ship logs to CloudWatch. Three things must be true:
1. Your entrypoint must be wrapped with
opentelemetry-instrument
.
CodeZip builds do this automatically. Docker/Container builds need it added manually — this is the #1 thing people miss.
In your Dockerfile CMD:
dockerfile
undefined
这是最常见的可观测性问题,尤其是对于容器/Docker构建。
AgentCore不会捕获原始stdout。它使用OpenTelemetry将日志发送到CloudWatch。必须满足三个条件:
1. 你的入口点必须用
opentelemetry-instrument
包裹。
CodeZip构建会自动完成此操作。Docker/容器构建需要手动添加——这是人们最容易遗漏的一点。
在你的Dockerfile CMD中:
dockerfile
undefined

✅ Correct — wrapped with opentelemetry-instrument

✅ 正确——用opentelemetry-instrument包裹

CMD ["opentelemetry-instrument", "python", "main.py"]
CMD ["opentelemetry-instrument", "python", "main.py"]

❌ Wrong — no OTEL wrapper, logs won't appear

❌ 错误——没有OTEL包裹,日志不会显示

CMD ["python", "main.py"]

**2. Your runtime IAM role needs CloudWatch and X-Ray permissions:**
logs:CreateLogGroup logs:CreateLogStream logs:PutLogEvents → scoped to /aws/bedrock-agentcore/runtimes/* xray:PutTelemetryRecords xray:PutTraceSegments → scoped to *

If using the AgentCore CLI with CodeZip, the CDK scaffold adds these automatically. If using a custom role or Container build, verify they're present.

**3. Use Python's `logging` module, not `print()`.**

OTEL hooks into `logging` automatically — no custom handlers needed. `print()` statements won't appear in CloudWatch.

```python
import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
CMD ["python", "main.py"]

**2. 你的运行时IAM角色需要CloudWatch和X-Ray权限:**
logs:CreateLogGroup logs:CreateLogStream logs:PutLogEvents → 范围限定为/aws/bedrock-agentcore/runtimes/* xray:PutTelemetryRecords xray:PutTraceSegments → 范围限定为*

如果使用AgentCore CLI进行CodeZip构建,CDK脚手架会自动添加这些权限。如果使用自定义角色或容器构建,请验证这些权限是否存在。

**3. 使用Python的`logging`模块,而非`print()`。**

OTEL会自动挂钩到`logging`模块——无需自定义处理程序。`print()`语句不会出现在CloudWatch中。

```python
import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

✅ This appears in CloudWatch

✅ 此内容会出现在CloudWatch中

logger.info("Processing request")
logger.info("处理请求")

❌ This does NOT appear in CloudWatch

❌ 此内容不会出现在CloudWatch中

print("Processing request")

**Also verify:** CloudWatch Transaction Search is enabled in your account. Without it, traces and spans won't appear in the GenAI Observability dashboard.
print("处理请求")

**还需验证:** 你的账户中已启用CloudWatch事务搜索。如果未启用,追踪数据和Span不会出现在GenAI可观测性仪表板中。

Logs missing for Terraform/CDK/IaC-deployed runtimes

Terraform/CDK/IaC部署的运行时日志缺失

A common pattern: a runtime deployed via Terraform, CDK, or a custom IAM role works correctly (returns responses) but no CloudWatch log streams appear — while the same agent code deployed via the AgentCore Console logs fine.
This is almost always an IAM scoping issue. The execution role for a runtime deployed via the Console gets broad CloudWatch permissions by default. IaC templates often scope those permissions narrowly to
/aws/bedrock-agentcore/runtimes/*
, which breaks log stream creation.
The fix:
logs:DescribeLogGroups
must have
Resource: "*"
, not a scoped resource. The other logs actions can be scoped to the runtime's log group.
json
{
  "Effect": "Allow",
  "Action": [
    "logs:DescribeLogGroups"
  ],
  "Resource": "*"
},
{
  "Effect": "Allow",
  "Action": [
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
  ],
  "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws/bedrock-agentcore/runtimes/*:*"
}
After updating the execution role's IAM policy, redeploy the runtime with
agentcore deploy
to pick up the new permissions.

常见情况:通过Terraform、CDK或自定义IAM角色部署的运行时工作正常(返回响应),但没有CloudWatch日志流显示——而通过AgentCore控制台部署的相同Agent代码日志正常。
这几乎总是IAM范围问题。控制台部署的运行时执行角色默认获得广泛的CloudWatch权限。IaC模板通常会将这些权限范围限定为
/aws/bedrock-agentcore/runtimes/*
,这会破坏日志流的创建。
修复方法:
logs:DescribeLogGroups
必须设置
Resource: "*"
,而不是限定范围的资源。其他日志操作可以限定为运行时的日志组。
json
{
  "Effect": "Allow",
  "Action": [
    "logs:DescribeLogGroups"
  ],
  "Resource": "*"
},
{
  "Effect": "Allow",
  "Action": [
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
  ],
  "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws/bedrock-agentcore/runtimes/*:*"
}
更新执行角色的IAM策略后,运行
agentcore deploy
重新部署运行时以获取新权限。

Symptom: Streaming connection drops mid-response

症状:流式连接在响应中途断开

Your agent uses SSE or long-polling responses and the connection drops mid-stream. Symptoms in client code:
  • RemoteProtocolError: peer closed connection without sending complete message body
  • IncompleteRead
    exception while iterating the stream
  • Silent disconnect — no error, no
    [DONE]
    event, response just stops
  • Happens during multi-tool-use conversations (5+ sequential tool calls)
  • Fails well before any client-side timeout
Root cause: Infrastructure-layer idle timeout on streaming connections. If no data flows on the response stream for several minutes (a silent period while a tool executes, for example), a load balancer in front of the runtime terminates the TCP connection.
The timeout is on data flowing through the stream, not on the request total duration. As long as you emit bytes periodically, the connection stays open.
Fix: emit keepalive events during long-running tool executions.
Python pattern for a streaming entrypoint:
python
import asyncio
import json
from bedrock_agentcore.runtime import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

async def emit_keepalive(tool_task):
    """Yield heartbeat events every 30s while tool_task is running."""
    while not tool_task.done():
        yield f"data: {json.dumps({'type': 'heartbeat'})}\n\n"
        try:
            await asyncio.wait_for(asyncio.shield(tool_task), timeout=30)
        except asyncio.TimeoutError:
            continue  # tool still running, emit another heartbeat

@app.entrypoint
async def invoke(payload, context):
    async def stream():
        tool_task = asyncio.create_task(run_long_tool(payload))

        # Emit heartbeats while the tool runs
        async for event in emit_keepalive(tool_task):
            yield event

        # Tool completed — emit the real result
        result = await tool_task
        yield f"data: {json.dumps({'type': 'result', 'content': result})}\n\n"
        yield "data: [DONE]\n\n"

    return stream()
Pick a heartbeat interval of ~30 seconds. Too long risks hitting the idle timeout; too short wastes bandwidth.
On the client side, filter heartbeat events before surfacing bytes to the user:
python
for chunk in response.iter_lines():
    if not chunk:
        continue
    data = json.loads(chunk.removeprefix(b"data: "))
    if data.get("type") == "heartbeat":
        continue  # ignore keepalives
    # process real events
Alternative: use the SDK's async task API for fire-and-forget patterns. If the client doesn't need to wait for the result, register the work via
add_async_task
/
complete_async_task
and return the invocation immediately. See
agents-harden
Long-running background tasks section.

你的Agent使用SSE或长轮询响应,连接在流传输中途断开。客户端代码中的症状:
  • RemoteProtocolError: peer closed connection without sending complete message body
  • 迭代流时出现
    IncompleteRead
    异常
  • 静默断开——无错误、无
    [DONE]
    事件,响应突然停止
  • 在多工具使用对话中发生(5次以上连续工具调用)
  • 在客户端超时前很久就失败
根本原因: 流式连接的基础设施层空闲超时。如果响应流中几分钟没有数据流动(例如工具执行期间的静默期),运行时前端的负载均衡器会终止TCP连接。
超时是基于流中流动的数据,而非请求总时长。只要定期发送字节,连接就会保持打开。
修复方法: 在长时间运行的工具执行期间发送保活事件。
流式入口点的Python模式:
python
import asyncio
import json
from bedrock_agentcore.runtime import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

async def emit_keepalive(tool_task):
    """在tool_task运行时每30秒生成一次心跳事件。"""
    while not tool_task.done():
        yield f"data: {json.dumps({'type': 'heartbeat'})}\n\n"
        try:
            await asyncio.wait_for(asyncio.shield(tool_task), timeout=30)
        except asyncio.TimeoutError:
            continue  # 工具仍在运行,生成下一次心跳

@app.entrypoint
async def invoke(payload, context):
    async def stream():
        tool_task = asyncio.create_task(run_long_tool(payload))

        # 工具运行时生成心跳
        async for event in emit_keepalive(tool_task):
            yield event

        # 工具完成——生成实际结果
        result = await tool_task
        yield f"data: {json.dumps({'type': 'result', 'content': result})}\n\n"
        yield "data: [DONE]\n\n"

    return stream()
选择约30秒的心跳间隔。间隔太长可能触发空闲超时;太短会浪费带宽。
在客户端,过滤心跳事件再将内容呈现给用户:
python
for chunk in response.iter_lines():
    if not chunk:
        continue
    data = json.loads(chunk.removeprefix(b"data: "))
    if data.get("type") == "heartbeat":
        continue  # 忽略保活事件
    # 处理实际事件
替代方案: 对即发即弃模式使用SDK的异步任务API。如果客户端不需要等待结果,通过
add_async_task
/
complete_async_task
注册任务并立即返回调用结果。请查看
agents-harden
技能的长时间运行后台任务部分。

Symptom: Traces appear merged across concurrent agent invocations

症状:并发Agent调用的追踪数据显示合并

You run multiple agent invocations in parallel with unique
runtimeSessionId
values, but the AI Observability dashboard groups them as one session — making it impossible to isolate a single run. Data plane logs show the session IDs are correctly unique 1:1 with request IDs, but the trace view still merges them.
Most common cause: the caller isn't enabling Active Tracing, so upstream spans arrive with
Sampled=0
. AgentCore respects upstream trace-sampling decisions by default. If the parent context says "don't sample," spans drop and concurrent invocations can appear merged in the dashboard.
Fix by caller type:
Lambda caller: Enable Active Tracing on the Lambda function.
bash
aws lambda update-function-configuration \
  --function-name my-caller-function \
  --tracing-config Mode=Active
Or in the Lambda console: Configuration → Monitoring and operations tools → AWS X-Ray → Active tracing.
ECS / EC2 / container caller: Initialize the AWS X-Ray SDK and ensure outbound calls to AgentCore are instrumented. For Python, use
aws-xray-sdk
and patch the SDK:
python
from aws_xray_sdk.core import xray_recorder, patch_all
patch_all()  # patches boto3, requests, etc.
Direct SDK caller without X-Ray: If you can't enable upstream tracing, force the runtime to sample by setting an environment variable on the agent:
OTEL_TRACES_SAMPLER=always_on
This makes the runtime sample every trace regardless of the parent context's sampling decision. Trade-off: higher tracing costs, but the traces are correct.
你使用唯一的
runtimeSessionId
值并行运行多个Agent调用,但AI可观测性仪表板将它们分组为一个会话——无法隔离单个运行。数据平面日志显示会话ID与请求ID正确对应,但追踪视图仍显示合并。
最常见原因: 调用方未启用主动追踪,因此上游Span以
Sampled=0
到达。AgentCore默认尊重上游追踪采样决策。如果父上下文表示“不采样”,Span会被丢弃,并发调用在仪表板中可能显示为合并。
按调用方类型修复:
Lambda调用方: 在Lambda函数上启用主动追踪。
bash
aws lambda update-function-configuration \
  --function-name my-caller-function \
  --tracing-config Mode=Active
或在Lambda控制台中:配置 → 监控和操作工具 → AWS X-Ray → 主动追踪。
ECS / EC2 / 容器调用方: 初始化AWS X-Ray SDK并确保对AgentCore的出站调用已被检测。对于Python,使用
aws-xray-sdk
并修补SDK:
python
from aws_xray_sdk.core import xray_recorder, patch_all
patch_all()  # 修补boto3、requests等
无X-Ray的直接SDK调用方: 如果无法启用上游追踪,可通过在Agent上设置环境变量强制运行时采样:
OTEL_TRACES_SAMPLER=always_on
这会使运行时对所有追踪进行采样,无论父上下文的采样决策如何。权衡:更高的追踪成本,但追踪数据正确。

Also check: invoking with the endpoint ARN instead of the agent ARN

还要检查:使用端点ARN而非Agent ARN调用

If traces show only a single top-level
AgentCore.Runtime.Invoke
span with no child spans, check the ARN your caller is using. The invoke target should be the agent runtime ARN:
arn:aws:bedrock-agentcore:<region>:<account>:runtime/<runtime-name>
Not the endpoint ARN:
arn:aws:bedrock-agentcore:<region>:<account>:runtime/<runtime-name>/runtime-endpoint/DEFAULT
Invoking with the endpoint ARN can bypass the full trace instrumentation path. This is a subtle trap — both ARNs produce successful responses, but only the agent ARN produces complete traces.

如果追踪仅显示单个顶级
AgentCore.Runtime.Invoke
Span且无子Span,请检查调用方使用的ARN。调用目标应为Agent运行时ARN:
arn:aws:bedrock-agentcore:<region>:<account>:runtime/<runtime-name>
而非端点ARN:
arn:aws:bedrock-agentcore:<region>:<account>:runtime/<runtime-name>/runtime-endpoint/DEFAULT
使用端点ARN调用可能会绕过完整的追踪检测路径。这是一个微妙的陷阱——两个ARN都能产生成功响应,但只有Agent ARN能产生完整的追踪数据。

Symptom: Runtime stuck in DELETING for hours

症状:运行时卡在DELETING状态数小时

You called
DeleteAgentRuntime
, got a successful response with
status: DELETING
, and the runtime has been stuck in that state for more than 30 minutes. Attempting to delete the default endpoint separately returns
ConflictException: Default endpoints are removed when you delete the agent.
What's happening: The deletion workflow is stuck on the service side. Retrying
DeleteAgentRuntime
won't help — the call succeeds immediately (returning DELETING) but the back-end workflow is the thing that's stuck. Customer-side tooling can't force-complete it.
What to do:
  1. Do not keep retrying. It won't unstick the workflow.
  2. Open an AWS Support case at https://console.aws.amazon.com/support. Include:
    • AWS Account ID
    • Region
    • Runtime ARN (or
      agentRuntimeId
      )
    • The
      requestId
      and timestamp of the original
      DeleteAgentRuntime
      call (from CloudTrail)
    • How long the runtime has been in DELETING state
  3. Work around it in the meantime. Deploy a new runtime with a different name if you need to keep shipping. Don't let the stuck resource block your work.
Orphaned resources from a stuck deletion (ENIs, workload identities) may need manual cleanup from the service team as part of the same case.

你调用了
DeleteAgentRuntime
,收到
status: DELETING
的成功响应,但运行时在此状态停留超过30分钟。尝试单独删除默认端点返回
ConflictException: Default endpoints are removed when you delete the agent.
实际情况: 删除工作流在服务端卡住。重试
DeleteAgentRuntime
无济于事——调用立即成功(返回DELETING)但后端工作流已卡住。客户侧工具无法强制完成它。
解决方法:
  1. 不要持续重试。 这无法解除工作流的卡住状态。
  2. 打开AWS支持工单,访问https://console.aws.amazon.com/support。提供:
    • AWS账户ID
    • 区域
    • 运行时ARN(或
      agentRuntimeId
    • 原始
      DeleteAgentRuntime
      调用的
      requestId
      和时间戳(来自CloudTrail)
    • 运行时处于DELETING状态的时长
  3. 同时进行临时处理。 如果需要继续工作,部署一个不同名称的新运行时。不要让卡住的资源阻碍你的工作。
卡住的删除操作产生的孤立资源(ENI、工作负载身份)可能需要服务团队在同一工单中手动清理。

Framework-specific issues

框架特定问题

LangGraph — model format: Older versions of
langchain-aws
required the model ID without the cross-region prefix. Recent versions may support cross-region inference profiles — check your installed version:
bash
pip show langchain-aws | grep Version
If you hit model errors with LangGraph, try the non-prefixed ID:
python
undefined
LangGraph — 模型格式: 旧版本的
langchain-aws
要求模型ID不带跨区域前缀。新版本可能支持跨区域推理配置文件——检查你的安装版本:
bash
pip show langchain-aws | grep Version
如果使用LangGraph时遇到模型错误,尝试使用无前缀的ID:
python
undefined

If cross-region prefix errors in your langchain-aws version:

如果你的langchain-aws版本不支持跨区域前缀:

llm = init_chat_model("anthropic.claude-sonnet-4-5-20250929-v1:0", model_provider="bedrock_converse")
llm = init_chat_model("anthropic.claude-sonnet-4-5-20250929-v1:0", model_provider="bedrock_converse")

If your version supports cross-region profiles (us. = US, eu. = Europe, apac. = Asia Pacific, global. = worldwide):

如果你的版本支持跨区域配置文件(us. = 美国, eu. = 欧洲, apac. = 亚太地区, global. = 全球):

llm = init_chat_model("global.anthropic.claude-sonnet-4-5-20250929-v1:0", ...)

Verify against the current langchain-aws release notes: https://github.com/langchain-ai/langchain-aws/releases — cross-region inference profile support has been evolving.

**Google ADK — Gemini only:**
ADK only works with Gemini models. If you're seeing model errors with ADK, check that `GEMINI_API_KEY` is set and you're using a `gemini-*` model ID.

**A2A agents — wrong port:**
A2A servers must run on port 9000. If your A2A agent isn't responding, check it's not accidentally running on 8080.

---
llm = init_chat_model("global.anthropic.claude-sonnet-4-5-20250929-v1:0", ...)

请参考当前langchain-aws发布说明:https://github.com/langchain-ai/langchain-aws/releases ——跨区域推理配置文件支持一直在演进。

**Google ADK — 仅支持Gemini:**
ADK仅适用于Gemini模型。如果使用ADK时遇到模型错误,请检查是否设置了`GEMINI_API_KEY`且使用的是`gemini-*`模型ID。

**A2A Agent — 端口错误:**
A2A服务器必须运行在9000端口。如果你的A2A Agent无响应,请检查它是否意外运行在8080端口。

---

Reading a trace

读取追踪数据

A trace shows the full execution path of one agent invocation. Key sections:
  • Model invocations — what the model was asked and what it responded
  • Tool calls — which tools were called, with what inputs, and what they returned
  • Memory operations — what was read from and written to memory
  • Policy decisions — what was allowed or denied (if policy engine is attached)
  • Latency breakdown — time spent in each component
bash
undefined
追踪数据显示一次Agent调用的完整执行路径。关键部分:
  • 模型调用 —— 模型收到的请求和返回的响应
  • 工具调用 —— 调用的工具、输入参数和返回结果
  • 内存操作 —— 从内存读取和写入的内容
  • 策略决策 —— 允许或拒绝的操作(如果附加了策略引擎)
  • 延迟细分 —— 每个组件消耗的时间
bash
undefined

Download trace to a file for detailed inspection

将追踪数据下载到文件进行详细检查

agentcore traces get <traceId> --runtime <AgentName> --output trace.json cat trace.json | jq '.trace.orchestrationTrace.modelInvocationOutput'
undefined
agentcore traces get <traceId> --runtime <AgentName> --output trace.json cat trace.json | jq '.trace.orchestrationTrace.modelInvocationOutput'
undefined

Output

输出

  • Diagnosis of the specific failure with root cause
  • Specific fix commands or code changes
  • Explanation of what the trace shows (if reading traces)
  • Handoff to the appropriate skill when the fix is outside debug's scope
  • 特定故障的诊断及根本原因
  • 具体的修复命令或代码更改
  • 追踪数据显示内容的解释(如果读取了追踪数据)
  • 当修复超出调试技能范围时,移交到相应技能

After diagnosis — handoff

诊断后——移交

Once you've identified the root cause, hand off to the skill that owns the fix:
Root causeHand off toDetail
Memory misconfigured (wrong strategy, namespace, wiring)
agents-build
Load
references/memory.md
Agent invocation from app not working (auth, URL, streaming)
agents-build
Load
references/integrate.md
VPC connectivity (can't reach RDS, no internet, AZ error)
agents-build
Load
references/vpc.md
Multi-agent delegation not working
agents-build
Load
references/multi-agent.md
Custom request headers not reaching agent code
agents-build
Load
references/request-headers.md
Cross-account invocation from an app in another account
agents-build
Load
references/integrate.md
(cross-account section)
Gateway auth misconfigured (401, wrong auth type)
agents-connect
Gateway auth matrix
Gateway target type question (Lambda vs OpenAPI vs MCP vs API Gateway)
agents-connect
"What Gateway is and isn't" section
Policy denying unexpectedly (Cedar, access denied on tool)
agents-connect
Load
references/policy.md
Observability not set up (no logs, no traces appearing)
agents-optimize
Load
references/observability.md
Cold start / initialization too slow
agents-harden
Initialization time section
Session lifecycle /
maxVms
/
StopRuntimeSession
agents-harden
Session lifecycle management section
Long-running background tasks being reclaimed
agents-harden
Long-running background tasks section
JWT inbound auth failing (403,
allowedClients
/
allowedAudience
, issuer mismatch)
agents-harden
Inbound auth section
Throttling / quota error / limit increase request
agents-harden
Load
references/limits.md
Deploy artifact stale or wrong version
agents-deploy
Redeploy workflow
Environment broken (CLI, credentials, Node, uv)Load
references/doctor.md
Self-contained in this skill
State the diagnosis clearly, then tell the developer which skill to use next. If the agent can load the referenced skill in the same session, do so.
一旦确定根本原因,移交到负责修复的技能:
根本原因移交到详情
内存配置错误(策略错误、命名空间错误、 wiring错误)
agents-build
加载
references/memory.md
应用调用Agent失败(认证、URL、流式传输)
agents-build
加载
references/integrate.md
VPC连接问题(无法访问RDS、无互联网、AZ错误)
agents-build
加载
references/vpc.md
多Agent委托失败
agents-build
加载
references/multi-agent.md
自定义请求头未到达Agent代码
agents-build
加载
references/request-headers.md
跨账户调用(来自另一个账户的应用)
agents-build
加载
references/integrate.md
(跨账户部分)
网关认证配置错误(401、认证类型错误)
agents-connect
网关认证矩阵
网关目标类型问题(Lambda vs OpenAPI vs MCP vs API Gateway)
agents-connect
“网关能做什么和不能做什么”部分
策略意外拒绝(Cedar、工具访问被拒绝)
agents-connect
加载
references/policy.md
可观测性未设置(无日志、无追踪数据显示)
agents-optimize
加载
references/observability.md
冷启动/初始化过慢
agents-harden
初始化时间部分
会话生命周期 /
maxVms
/
StopRuntimeSession
agents-harden
会话生命周期管理部分
长时间运行的后台任务被回收
agents-harden
长时间运行后台任务部分
JWT入站认证失败(403、
allowedClients
/
allowedAudience
、发行方不匹配)
agents-harden
入站认证部分
限流/配额错误/请求配额增加
agents-harden
加载
references/limits.md
部署工件过期或版本错误
agents-deploy
重新部署流程
环境损坏(CLI、凭证、Node、uv)加载
references/doctor.md
本技能内自包含
清晰说明诊断结果,然后告知开发者下一步使用哪个技能。如果Agent能在同一会话中加载相关技能,请直接加载。