agents-debug

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

debug

调试

Diagnose why your AgentCore agent or environment isn't working correctly.

诊断你的AgentCore Agent或环境无法正常工作的原因。

When to use

适用场景

Your agent is returning wrong answers or errors
Tool calls are failing or timing out
Agent works locally but fails after deploying
Logs aren't showing up in CloudWatch
The AgentCore CLI isn't working or environment seems broken
```
agentcore
```
command not found or prerequisites are missing

Do NOT use for:

Deploy failures (CDK errors, IAM during deploy) → use
```
agents-deploy
```
Scaffolding a new project → use
```
agents-get-started
```
Measuring quality or setting up monitoring → use
```
agents-optimize
```

Agent返回错误答案或报错
工具调用失败或超时
Agent在本地运行正常，但部署后失败
日志未出现在CloudWatch中
AgentCore CLI无法工作或环境似乎损坏
```
agentcore
```
命令未找到或缺少先决条件

请勿用于：

部署失败（CDK错误、部署期间的IAM问题）→ 使用
```
agents-deploy
```
搭建新项目脚手架 → 使用
```
agents-get-started
```
衡量质量或设置监控 → 使用
```
agents-optimize
```

Input

输入

$ARGUMENTS

is optional:

/agents-debug                      # interactive — describe what's wrong
/agents-debug traces               # read and explain recent traces
/agents-debug logs                 # search recent logs for errors
/agents-debug memory               # diagnose memory recall issues specifically
/agents-debug doctor               # check environment prerequisites

$ARGUMENTS

为可选参数：

/agents-debug                      # 交互式模式——描述具体问题
/agents-debug traces               # 读取并解释最近的追踪数据
/agents-debug logs                 # 搜索最近的日志以查找错误
/agents-debug memory               # 专门诊断内存召回问题
/agents-debug doctor               # 检查环境先决条件

Process

流程

Step 0: Determine problem type

步骤0：确定问题类型

If the developer's issue is about the CLI itself (command not found, prerequisites, environment setup), load

references/doctor.md

and follow its diagnostic checklist.

If the issue is about agent behavior (wrong answers, errors, timeouts, tool failures), continue with Step 1 below.

如果开发者的问题是关于CLI本身（命令未找到、先决条件缺失、环境设置问题），加载

references/doctor.md

并遵循其诊断清单。

如果问题是关于Agent行为（错误答案、报错、超时、工具失败），请继续执行以下步骤1。

Step 1: Verify CLI version

步骤1：验证CLI版本

Run

agentcore --version

. This skill requires v0.9.0 or later. If the version is older, tell the developer to run

agentcore update

before proceeding.

运行

agentcore --version

。此技能需要v0.9.0或更高版本。如果版本较旧，请告知开发者先运行

agentcore update

再继续。

Step 2: Understand the symptom

步骤2：了解症状

Ask (or infer from context):

"What's happening?

The agent returns an error message

The agent returns a wrong or unhelpful answer

A specific tool call is failing

Memory isn't working (agent doesn't remember things)

The agent is slow or timing out

I want to understand what the agent did in a specific session"

询问（或从上下文推断）：

"具体发生了什么？

Agent返回错误消息

Agent返回错误或无用的答案

特定工具调用失败

内存无法正常工作（Agent无法记住信息）

Agent运行缓慢或超时

我想了解Agent在特定会话中的行为"

Step 3: Read traces and logs automatically

步骤3：自动读取追踪和日志

Don't ask the developer to paste logs — read them directly.

bash

undefined

不要让开发者粘贴日志——直接读取它们。

bash

undefined

List recent traces

列出最近的追踪数据

agentcore traces list --runtime <AgentName> --since 1h

Get the most recent trace ID

获取最新的追踪ID

agentcore traces list --runtime <AgentName> --since 1h --limit 1

Download and read the trace

下载并读取追踪数据

agentcore traces get <traceId> --runtime <AgentName>

Search logs for errors

搜索日志中的错误

agentcore logs --runtime <AgentName> --since 1h --level error

Search logs for a specific pattern

搜索日志中的特定模式

agentcore logs --runtime <AgentName> --since 2h --query "timeout" agentcore logs --runtime <AgentName> --since 2h --query "model access"


**Important:** CloudWatch put-to-get latency is **~10 seconds end-to-end** — that's the delay from when a span is emitted to when it's readable by `agentcore traces get` or `agentcore run eval`. There is **no separate "trace ingested but eval not ready yet" window**; the same ingestion step unlocks both paths. Older skills and docs said 30–60s for traces and 2–5 minutes for evals — both are stale. If you just invoked the agent, wait ~15 seconds and both trace reads and evals will work.

Read `agentcore/agentcore.json` to get the agent name if not provided.

agentcore logs --runtime <AgentName> --since 2h --query "timeout" agentcore logs --runtime <AgentName> --since 2h --query "model access"


**重要提示：** CloudWatch的写入到读取延迟约为**10秒端到端**——即从Span发出到`agentcore traces get`或`agentcore run eval`可读取的延迟。不存在单独的“追踪已摄入但评估未就绪”窗口；同一摄入步骤同时解锁两条路径。旧版技能和文档称追踪需要30–60秒、评估需要2–5分钟——这些信息已过时。如果你刚调用了Agent，请等待约15秒，追踪读取和评估功能即可正常使用。

如果未提供Agent名称，请读取`agentcore/agentcore.json`获取。

Step 4: Diagnose by symptom

步骤4：按症状诊断

Symptom: "model access denied" or model error

症状："model access denied"或模型错误

Most common cause: The model isn't enabled in the Bedrock console for your region.

Fix:

Go to AWS Console → Amazon Bedrock → Model access
Enable the model your agent uses
Wait 1–2 minutes for access to propagate

Second cause: The execution role is missing

bedrock:InvokeModel

Check:

bash

aws iam simulate-principal-policy \
  --policy-source-arn $(agentcore status --json | jq -r '.runtimes[0].executionRoleArn') \
  --action-names bedrock:InvokeModel \
  --resource-arns "arn:aws:bedrock:*::foundation-model/*"

Third cause: Cross-region inference profile requires model access in all regions.

Model IDs starting with a geographic prefix are cross-region inference profiles that route requests within that geography:

Prefix	Geography	Example destination regions
`us.`	United States	us-east-1, us-east-2, us-west-2
`eu.`	Europe	eu-central-1, eu-west-1, eu-west-2, eu-west-3
`apac.`	Asia Pacific	ap-northeast-1, ap-southeast-1, ap-southeast-2, ap-south-1
`global.`	All commercial regions worldwide	All supported regions

The AgentCore CLI scaffolds

global.

by default (e.g.,

global.anthropic.claude-sonnet-4-5-20250929-v1:0

). All prefixes require model access enabled in every destination region the profile covers. For

us.

profiles, enable in all US regions; for

eu.

, all EU regions; for

global.

, all supported regions. Not all models support all prefixes —

global.

is currently available for select models only. Use

global.

for maximum throughput when available, or a geographic prefix when data residency requirements constrain where inference can run. Check the Bedrock inference profiles docs for current model × prefix availability.

最常见原因： 模型未在对应区域的Bedrock控制台中启用。

修复方法：

进入AWS控制台 → Amazon Bedrock → 模型访问
启用你的Agent使用的模型
等待1–2分钟让访问权限生效

第二个原因： 执行角色缺少

bedrock:InvokeModel

权限。

检查：

bash

aws iam simulate-principal-policy \
  --policy-source-arn $(agentcore status --json | jq -r '.runtimes[0].executionRoleArn') \
  --action-names bedrock:InvokeModel \
  --resource-arns "arn:aws:bedrock:*::foundation-model/*"

第三个原因： 跨区域推理配置文件需要在所有区域都启用模型访问。

以地理前缀开头的模型ID是跨区域推理配置文件，会将请求路由到该地理范围内的区域：

前缀	地理范围	示例目标区域
`us.`	美国	us-east-1, us-east-2, us-west-2
`eu.`	欧洲	eu-central-1, eu-west-1, eu-west-2, eu-west-3
`apac.`	亚太地区	ap-northeast-1, ap-southeast-1, ap-southeast-2, ap-south-1
`global.`	全球所有商用区域	所有支持的区域

AgentCore CLI默认使用

global.

前缀（例如

global.anthropic.claude-sonnet-4-5-20250929-v1:0

）。所有前缀都要求在配置文件覆盖的每个目标区域启用模型访问。对于

us.

配置文件，需在所有美国区域启用；对于

eu.

，需在所有欧洲区域启用；对于

global.

，需在所有支持的区域启用。并非所有模型都支持所有前缀——目前只有部分模型支持

global.

前缀。当可用时，使用

global.

以获得最大吞吐量；当数据驻留要求限制推理运行位置时，使用地理前缀。请查看Bedrock推理配置文件文档了解当前模型×前缀的可用性。

Symptom: Tool call failing

症状：工具调用失败

Step 1: Find the failing tool call in the trace:

bash

agentcore traces get <traceId> --runtime <AgentName>

Look for tool call entries with error status.

Step 2: Check the gateway status:

bash

agentcore status --type gateway
agentcore fetch access --name <AgentName> --type agent

Step 3: Common tool call failures:

Gateway URL not set (local dev): The

AGENTCORE_GATEWAY_*_URL

env var is only set after deploy. In

agentcore dev

, gateway tools aren't available. This is expected — the agent should handle this gracefully.

Auth failure on tool call:

bash

agentcore logs --runtime <AgentName> --since 1h --query "auth"

Check that the credential is configured correctly:

agentcore status --type credential

Lambda function error: The Lambda itself is failing. Check Lambda logs directly:

bash

aws logs tail /aws/lambda/<function-name> --since 1h

Policy denial: If a policy engine is attached, check policy decision logs:

bash

agentcore logs --runtime <AgentName> --since 1h --query "policy"
agentcore status --type policy-engine

步骤1： 在追踪数据中找到失败的工具调用：

bash

agentcore traces get <traceId> --runtime <AgentName>

查找带有错误状态的工具调用条目。

步骤2： 检查网关状态：

bash

agentcore status --type gateway
agentcore fetch access --name <AgentName> --type agent

步骤3： 常见工具调用失败原因：

网关URL未设置（本地开发）：

AGENTCORE_GATEWAY_*_URL

环境变量仅在部署后设置。在

agentcore dev

模式下，网关工具不可用。这是预期行为——Agent应能优雅处理此情况。

工具调用认证失败：

bash

agentcore logs --runtime <AgentName> --since 1h --query "auth"

检查凭证是否配置正确：

agentcore status --type credential

Lambda函数错误： Lambda本身执行失败。直接检查Lambda日志：

bash

aws logs tail /aws/lambda/<function-name> --since 1h

策略拒绝： 如果附加了策略引擎，请检查策略决策日志：

bash

agentcore logs --runtime <AgentName> --since 1h --query "policy"
agentcore status --type policy-engine

Symptom: Wrong or unhelpful answers

症状：返回错误或无用的答案

Step 1: Read the trace to see the agent's reasoning:

bash

agentcore traces get <traceId> --runtime <AgentName>

The trace shows the model's reasoning steps, tool calls made, and the final response. Look for:

Did the agent use the right tools?
Did the tool calls return the expected data?
Is the system prompt providing the right context?

Step 2: Check if memory is involved: If the agent should be using memory context but isn't, see the "Symptom: Memory not persisting" section later in this skill, or load

references/doctor.md

if this is an environment issue.

Step 3: Common causes:

System prompt is too vague or missing key context
Agent isn't calling the right tools (tool descriptions need improvement)
Tool is returning unexpected data format
Model ID is wrong for the task (e.g., using a smaller model for complex reasoning)

步骤1： 读取追踪数据查看Agent的推理过程：

bash

agentcore traces get <traceId> --runtime <AgentName>

追踪数据显示模型的推理步骤、调用的工具以及最终响应。检查：

Agent是否使用了正确的工具？
工具调用是否返回了预期数据？
系统提示是否提供了正确的上下文？

步骤2： 检查是否涉及内存：如果Agent应使用内存上下文但未使用，请查看本技能后面的“症状：内存未持久化”部分；如果是环境问题，请加载

references/doctor.md

。

步骤3： 常见原因：

系统提示过于模糊或缺少关键上下文
Agent未调用正确的工具（工具描述需要改进）
工具返回意外的数据格式
模型ID不适合当前任务（例如使用小型模型处理复杂推理）

Symptom: Memory not working

症状：内存无法正常工作

Memory not persisting across sessions (LTM):

Verify LTM strategies are configured (SEMANTIC or USER_PREFERENCE):

bash

agentcore status --type memory --json | jq '.memories[].strategies'

Wait 5–30 seconds after a session ends — LTM extraction is async. The agent must finish its session before facts are extracted.
Use UUIDs (v4) for session IDs — the platform requires a minimum of 33 characters. Short IDs like "session-1" cause LTM to fail silently.
```
agentcore invoke
```
generates compliant IDs by default.
Verify the memory resource is ACTIVE:

bash

agentcore status --type memory

Memory not loading at session start:

Check the
```
MEMORY_*_ID
```
env var is set:

bash

agentcore status --type memory --json | jq '.memories[].id'

Verify the
```
actor_id
```
is consistent across sessions — memory is scoped per actor.
Check the namespace paths in your retrieval config match the namespaces used when writing.

跨会话内存未持久化（LTM）：

验证是否配置了LTM策略（SEMANTIC或USER_PREFERENCE）：

bash

agentcore status --type memory --json | jq '.memories[].strategies'

会话结束后等待5–30秒——LTM提取是异步操作。Agent必须完成会话后才能提取事实。
对会话ID使用UUID（v4）——平台要求至少33个字符。像"session-1"这样的短ID会导致LTM静默失败。
```
agentcore invoke
```
默认生成符合要求的ID。
验证内存资源处于ACTIVE状态：

bash

agentcore status --type memory

会话启动时内存未加载：

检查
```
MEMORY_*_ID
```
环境变量是否已设置：

bash

agentcore status --type memory --json | jq '.memories[].id'

验证跨会话的
```
actor_id
```
是否一致——内存是按actor范围划分的。
检查检索配置中的命名空间路径与写入时使用的命名空间是否匹配。

Symptom: Agent timeout

症状：Agent超时

Step 1: Check the trace for where time is being spent:

bash

agentcore traces get <traceId> --runtime <AgentName>

Look for long-running steps — model calls, tool calls, memory operations.

Step 2: Common timeout causes:

Slow agent initialization: If the first invocation after an idle period is slow but subsequent requests are fast, the agent is spending too much time initializing. Check for heavy imports at module level, database connections in global scope, or MCP client initialization during startup. Move expensive setup into the request handler or use lazy initialization. See the

agents-harden

skill for optimization guidance.

Model call timeout: The model is taking too long. Consider using a faster model for time-sensitive operations (e.g., Haiku instead of Sonnet for simple tasks).

Tool call timeout: The Lambda or external API is slow. Check the tool's own logs.

Memory retrieval timeout: Semantic search can be slow for large memory stores. Consider reducing

top_k

in your retrieval config.

VPC connectivity issue: If the agent is in a VPC, check security group rules and route tables. See

agents-build

(loads

references/vpc.md

) for VPC-specific debugging.

步骤1： 检查追踪数据查看时间消耗位置：

bash

agentcore traces get <traceId> --runtime <AgentName>

查找长时间运行的步骤——模型调用、工具调用、内存操作。

步骤2： 常见超时原因：

Agent初始化缓慢： 如果空闲后的第一次调用缓慢，但后续请求快速，说明Agent初始化耗时过长。检查模块级别的重导入、全局作用域中的数据库连接或启动期间的MCP客户端初始化。将昂贵的设置移至请求处理程序中或使用延迟初始化。请查看

agents-harden

技能获取优化指导。

模型调用超时： 模型响应过慢。对于时间敏感的操作，考虑使用更快的模型（例如使用Haiku代替Sonnet处理简单任务）。

工具调用超时： Lambda或外部API响应缓慢。检查工具自身的日志。

内存检索超时： 对于大型内存存储，语义搜索可能较慢。考虑减少检索配置中的

top_k

值。

VPC连接问题： 如果Agent在VPC中，请检查安全组规则和路由表。请查看

agents-build

（加载

references/vpc.md

）获取VPC特定的调试方法。

Symptom:

ServiceQuotaExceededException: maxVms limit exceeded

(despite low observed concurrency)

症状：

ServiceQuotaExceededException: maxVms limit exceeded

（尽管观察到的并发量很低）

Your CloudWatch "concurrent sessions" metric shows modest numbers (maybe 30–50) but

InvokeAgentRuntime

calls return

ServiceQuotaExceededException: maxVms limit exceeded

What's actually happening: CloudWatch's concurrent-sessions metric is not the same as live microVM count. The

maxVms

quota counts all environments your account has active — including ones that finished their invocation but haven't been reclaimed yet. Idle-but-not-yet-reclaimed environments count against the quota until

idleRuntimeSessionTimeout

expires (default 900 seconds / 15 minutes) or you explicitly stop them.

If your code uses a new session ID per request and doesn't call

StopRuntimeSession

, every request leaves an environment sitting idle for 15 minutes counting against the quota.

Fix order (try in this order before requesting a quota increase):

Call
StopRuntimeSession
after each logical request completes. If you're not going to send more requests on this session, stop it explicitly.

python

client.stop_runtime_session(
    agentRuntimeArn=runtime_arn,
    runtimeSessionId=session_id,
)

Reuse session IDs across related requests. If a user interaction produces multiple backend calls, route them to the same session instead of generating a new session ID per call.
Lower
idleRuntimeSessionTimeout
. If your sessions are short-lived and you can't add
```
StopRuntimeSession
```
everywhere, lower the timeout by editing the runtime's
```
lifecycleConfiguration
```
in
```
agentcore/agentcore.json
```
and running
```
agentcore deploy
```
.
Only after the above, request a quota increase. See
```
agents-harden
```
(loads
```
references/limits.md
```
) — request it through the Service Quotas console (Amazon Bedrock AgentCore), not by filing a support ticket directly.

See

agents-harden

Session lifecycle management section for the full pattern.

你的CloudWatch“并发会话”指标显示数值适中（可能30–50），但

InvokeAgentRuntime

调用返回

ServiceQuotaExceededException: maxVms limit exceeded

。

实际情况： CloudWatch的并发会话指标与实时微VM数量不同。

maxVms

配额统计账户中所有活跃的环境——包括已完成调用但尚未被回收的环境。空闲但尚未回收的环境在

idleRuntimeSessionTimeout

过期（默认900秒/15分钟）或你显式停止它们之前，都会占用配额。

如果你的代码每个请求使用新的会话ID且未调用

StopRuntimeSession

，每个请求都会留下一个环境空闲15分钟，占用配额。

修复顺序（在请求配额增加前按此顺序尝试）：

每个逻辑请求完成后调用
StopRuntimeSession
。如果不会在此会话上发送更多请求，请显式停止它。

python

client.stop_runtime_session(
    agentRuntimeArn=runtime_arn,
    runtimeSessionId=session_id,
)

跨相关请求重用会话ID。 如果用户交互产生多个后端调用，将它们路由到同一个会话，而不是每个调用生成新的会话ID。
降低
idleRuntimeSessionTimeout
。如果你的会话是短期的且无法在所有地方添加
```
StopRuntimeSession
```
，请编辑
```
agentcore/agentcore.json
```
中运行时的
```
lifecycleConfiguration
```
并运行
```
agentcore deploy
```
来降低超时时间。
仅在完成上述步骤后，请求配额增加。 请查看
```
agents-harden
```
（加载
```
references/limits.md
```
）——通过Service Quotas控制台（Amazon Bedrock AgentCore）请求，不要直接提交支持工单。

请查看

agents-harden

技能的会话生命周期管理部分获取完整模式。

Symptom: 424 Failed Dependency on invoke

症状：调用时出现424 Failed Dependency

This usually means the agent container failed to start or crashed during initialization.

Step 1: Check the agent logs for startup errors:

bash

agentcore logs --runtime <AgentName> --since 30m --level error

Step 2: Common causes:

Missing Python dependency: The agent code imports a package not in

pyproject.toml

. The container starts but crashes on first request. Fix: add the dependency and redeploy.

Entrypoint crash: The

main.py

throws an exception during import or

app.run()

. Check logs for the traceback.

Container image pull failure: If using Container build, the ECR image may not exist or the execution role lacks

ecr:BatchGetImage

. Check:

bash

agentcore status --runtime <AgentName> --json

Memory resource not ACTIVE: If the agent code assumes memory is available but the memory resource is still in CREATING state, the entrypoint may fail. Check:

bash

agentcore status --type memory

Initialization timeout: The agent takes too long to be ready for its first request — heavy imports at module level, synchronous database connections, or MCP client initialization during startup can exceed the service's health-check window. The symptom looks like a 424 on the first invoke but healthy on subsequent ones. Fix: move expensive setup out of module level, use lazy initialization, or warm the agent before production traffic. See

agents-harden

Initialization time section for patterns.

这通常意味着Agent容器启动失败或初始化期间崩溃。

步骤1： 检查Agent日志中的启动错误：

bash

agentcore logs --runtime <AgentName> --since 30m --level error

步骤2： 常见原因：

缺少Python依赖： Agent代码导入了

pyproject.toml

中未列出的包。容器启动但在第一次请求时崩溃。修复方法：添加依赖并重新部署。

入口点崩溃：

main.py

在导入或

app.run()

期间抛出异常。查看日志中的回溯信息。

容器镜像拉取失败： 如果使用容器构建，ECR镜像可能不存在或执行角色缺少

ecr:BatchGetImage

权限。检查：

bash

agentcore status --runtime <AgentName> --json

内存资源未处于ACTIVE状态： 如果Agent代码假设内存可用但内存资源仍处于CREATING状态，入口点可能失败。检查：

bash

agentcore status --type memory

初始化超时： Agent在首次请求前准备就绪耗时过长——模块级别的重导入、同步数据库连接或启动期间的MCP客户端初始化可能超过服务的健康检查窗口。症状表现为首次调用出现424错误，但后续调用正常。修复方法：将昂贵的设置移出模块级别，使用延迟初始化，或在生产流量前预热Agent。请查看

agents-harden

技能的初始化时间部分获取模式。

Symptom: Local invocations fail with connection-refused / exit code 7

症状：本地调用失败，出现connection-refused / exit code 7

Usually not an agent bug — the dev server is on a different port than you expect.

Default ports
agentcore dev
binds:

Protocol	Default
HTTP	8080
MCP	8000
A2A	9000

When the default is occupied (second dev session, a lingering process from a previous run, another service on 8080), the CLI auto-increments silently: 8080 → 8081 → 8082. A test harness or

curl

script hardcoded to 8080 will get

Connection refused

(curl exit code 7) while the agent is running fine on 8082.

Diagnose in this order:

Read the CLI banner that
```
agentcore dev
```
prints — it shows the actual bound port and URL. This is always the source of truth.
If the banner is gone (terminal cleared, running in background), check the log file:
bash
```
tail -20 agentcore/.cli/logs/dev/*.log
```

Or find the process directly:

bash

# macOS / Linux
ps aux | grep -E 'agentcore dev|uvicorn' | grep -v grep
lsof -iTCP -sTCP:LISTEN -n -P | grep -E '8080|8081|8082|8000|9000'

Fix options:

Pin the port explicitly:
```
agentcore dev --port 8080
```

Kill the process squatting on the default:

lsof -tiTCP:8080 -sTCP:LISTEN | xargs kill

Update the hardcoded port in your test harness to read from the CLI output or from an env var

This is also a common source of "works locally one day, fails the next" reports — the port shifted between runs.

通常不是Agent bug——开发服务器使用的端口与你预期的不同。

agentcore dev
绑定的默认端口：

协议	默认端口
HTTP	8080
MCP	8000
A2A	9000

当默认端口被占用时（第二个开发会话、上一次运行的残留进程、其他服务占用8080），CLI会自动递增端口：8080 → 8081 → 8082。硬编码为8080的测试工具或

curl

脚本会收到

Connection refused

（curl退出码7），而Agent实际上在8082端口正常运行。

按以下顺序诊断：

查看
```
agentcore dev
```
打印的CLI横幅——它显示实际绑定的端口和URL。这始终是最准确的信息来源。
如果横幅已消失（终端已清空、在后台运行），请查看日志文件：
bash
```
tail -20 agentcore/.cli/logs/dev/*.log
```

或直接查找进程：

bash

# macOS / Linux
ps aux | grep -E 'agentcore dev|uvicorn' | grep -v grep
lsof -iTCP -sTCP:LISTEN -n -P | grep -E '8080|8081|8082|8000|9000'

修复选项：

显式固定端口：
```
agentcore dev --port 8080
```

占用默认端口的进程：

lsof -tiTCP:8080 -sTCP:LISTEN | xargs kill

更新测试工具中的硬编码端口，改为读取CLI输出或环境变量

这也是“某天本地运行正常，第二天失败”报告的常见原因——端口在两次运行之间发生了变化。

Symptom: Gateway tool calls failing with auth errors

症状：网关工具调用因认证错误失败

Step 1: Verify the auth type matches the target type. This is the most common gateway error — using the wrong outbound auth for the target:

Target type	Valid outbound auth
`mcp-server`	`none` , `oauth` , or IAM (SigV4 via API)
`lambda-function-arn`	IAM only (automatic)
`open-api-schema`	`oauth` or `api-key` (required)
`api-gateway`	`none` , `api-key` , or IAM
`smithy-model`	IAM or `oauth`

Step 2: Check for expired OAuth tokens. If the gateway target uses OAuth, the access token may have expired. Look for auth-related errors:

bash

agentcore logs --runtime <AgentName> --since 1h --query "auth"
agentcore logs --runtime <AgentName> --since 1h --query "401"
agentcore logs --runtime <AgentName> --since 1h --query "403"

If tokens are expiring, verify the OAuth credential provider's token endpoint is reachable and the client credentials are still valid. For MCP server targets with OAuth, the gateway handles token refresh automatically — if it's failing, the credential provider config may be wrong.

Step 3: Check the credential is configured:

bash

agentcore status --type credential
agentcore status --type gateway --json

步骤1： 验证认证类型与目标类型匹配。这是最常见的网关错误——对目标使用了错误的出站认证：

目标类型	有效出站认证
`mcp-server`	`none` 、 `oauth` 或IAM（通过API的SigV4）
`lambda-function-arn`	仅IAM（自动）
`open-api-schema`	`oauth` 或 `api-key` （必填）
`api-gateway`	`none` 、 `api-key` 或IAM
`smithy-model`	IAM或 `oauth`

步骤2： 检查OAuth令牌是否过期。如果网关目标使用OAuth，访问令牌可能已过期。查找与认证相关的错误：

bash

agentcore logs --runtime <AgentName> --since 1h --query "auth"
agentcore logs --runtime <AgentName> --since 1h --query "401"
agentcore logs --runtime <AgentName> --since 1h --query "403"

如果令牌过期，请验证OAuth凭证提供商的令牌端点是否可达，且客户端凭证仍然有效。对于使用OAuth的MCP服务器目标，网关会自动处理令牌刷新——如果失败，可能是凭证提供商配置错误。

步骤3： 检查凭证是否已配置：

bash

agentcore status --type credential
agentcore status --type gateway --json

Symptom: No traces appearing

症状：没有追踪数据显示

Wait ~15 seconds — there's a short delay (typically ~10s) between invocation and trace availability.

If still no traces after ~30 seconds:

Verify observability was enabled when the agent was deployed

Check the agent was actually invoked:

agentcore logs --runtime <AgentName> --since 1h

Check CloudWatch permissions on the execution role

等待约15秒——调用与追踪数据可用之间存在短暂延迟（通常约10秒）。

如果约30秒后仍无追踪数据：

验证部署Agent时是否启用了可观测性

检查Agent是否确实被调用：

agentcore logs --runtime <AgentName> --since 1h

检查执行角色的CloudWatch权限

Symptom: CloudWatch logs not appearing

症状：CloudWatch日志未显示

This is the most common observability issue, especially for Container/Docker builds.

AgentCore doesn't capture raw stdout. It uses OpenTelemetry to ship logs to CloudWatch. Three things must be true:

1. Your entrypoint must be wrapped with
opentelemetry-instrument
.

CodeZip builds do this automatically. Docker/Container builds need it added manually — this is the #1 thing people miss.

In your Dockerfile CMD:

dockerfile

undefined

这是最常见的可观测性问题，尤其是对于容器/Docker构建。

AgentCore不会捕获原始stdout。它使用OpenTelemetry将日志发送到CloudWatch。必须满足三个条件：

1. 你的入口点必须用
opentelemetry-instrument
包裹。

CodeZip构建会自动完成此操作。Docker/容器构建需要手动添加——这是人们最容易遗漏的一点。

在你的Dockerfile CMD中：

dockerfile

undefined

✅ Correct — wrapped with opentelemetry-instrument

✅ 正确——用opentelemetry-instrument包裹

CMD ["opentelemetry-instrument", "python", "main.py"]

❌ Wrong — no OTEL wrapper, logs won't appear

❌ 错误——没有OTEL包裹，日志不会显示

CMD ["python", "main.py"]


**2. Your runtime IAM role needs CloudWatch and X-Ray permissions:**

logs:CreateLogGroup logs:CreateLogStream logs:PutLogEvents → scoped to /aws/bedrock-agentcore/runtimes/* xray:PutTelemetryRecords xray:PutTraceSegments → scoped to *


If using the AgentCore CLI with CodeZip, the CDK scaffold adds these automatically. If using a custom role or Container build, verify they're present.

**3. Use Python's `logging` module, not `print()`.**

OTEL hooks into `logging` automatically — no custom handlers needed. `print()` statements won't appear in CloudWatch.

```python
import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

CMD ["python", "main.py"]


**2. 你的运行时IAM角色需要CloudWatch和X-Ray权限：**

logs:CreateLogGroup logs:CreateLogStream logs:PutLogEvents → 范围限定为/aws/bedrock-agentcore/runtimes/* xray:PutTelemetryRecords xray:PutTraceSegments → 范围限定为*


如果使用AgentCore CLI进行CodeZip构建，CDK脚手架会自动添加这些权限。如果使用自定义角色或容器构建，请验证这些权限是否存在。

**3. 使用Python的`logging`模块，而非`print()`。**

OTEL会自动挂钩到`logging`模块——无需自定义处理程序。`print()`语句不会出现在CloudWatch中。

```python
import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

✅ This appears in CloudWatch

✅ 此内容会出现在CloudWatch中

logger.info("Processing request")

logger.info("处理请求")

❌ This does NOT appear in CloudWatch

❌ 此内容不会出现在CloudWatch中

print("Processing request")


**Also verify:** CloudWatch Transaction Search is enabled in your account. Without it, traces and spans won't appear in the GenAI Observability dashboard.

print("处理请求")


**还需验证：** 你的账户中已启用CloudWatch事务搜索。如果未启用，追踪数据和Span不会出现在GenAI可观测性仪表板中。

Logs missing for Terraform/CDK/IaC-deployed runtimes

Terraform/CDK/IaC部署的运行时日志缺失

A common pattern: a runtime deployed via Terraform, CDK, or a custom IAM role works correctly (returns responses) but no CloudWatch log streams appear — while the same agent code deployed via the AgentCore Console logs fine.

This is almost always an IAM scoping issue. The execution role for a runtime deployed via the Console gets broad CloudWatch permissions by default. IaC templates often scope those permissions narrowly to

/aws/bedrock-agentcore/runtimes/*

, which breaks log stream creation.

The fix:

logs:DescribeLogGroups

must have

Resource: "*"

, not a scoped resource. The other logs actions can be scoped to the runtime's log group.

json

{
  "Effect": "Allow",
  "Action": [
    "logs:DescribeLogGroups"
  ],
  "Resource": "*"
},
{
  "Effect": "Allow",
  "Action": [
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
  ],
  "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws/bedrock-agentcore/runtimes/*:*"
}

After updating the execution role's IAM policy, redeploy the runtime with

agentcore deploy

to pick up the new permissions.

常见情况：通过Terraform、CDK或自定义IAM角色部署的运行时工作正常（返回响应），但没有CloudWatch日志流显示——而通过AgentCore控制台部署的相同Agent代码日志正常。

这几乎总是IAM范围问题。控制台部署的运行时执行角色默认获得广泛的CloudWatch权限。IaC模板通常会将这些权限范围限定为

/aws/bedrock-agentcore/runtimes/*

，这会破坏日志流的创建。

修复方法：

logs:DescribeLogGroups

必须设置

Resource: "*"

，而不是限定范围的资源。其他日志操作可以限定为运行时的日志组。

json

{
  "Effect": "Allow",
  "Action": [
    "logs:DescribeLogGroups"
  ],
  "Resource": "*"
},
{
  "Effect": "Allow",
  "Action": [
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
  ],
  "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws/bedrock-agentcore/runtimes/*:*"
}

更新执行角色的IAM策略后，运行

agentcore deploy

重新部署运行时以获取新权限。

Symptom: Streaming connection drops mid-response

症状：流式连接在响应中途断开

Your agent uses SSE or long-polling responses and the connection drops mid-stream. Symptoms in client code:

RemoteProtocolError: peer closed connection without sending complete message body

```
IncompleteRead
```
exception while iterating the stream
Silent disconnect — no error, no
```
[DONE]
```
event, response just stops
Happens during multi-tool-use conversations (5+ sequential tool calls)
Fails well before any client-side timeout

Root cause: Infrastructure-layer idle timeout on streaming connections. If no data flows on the response stream for several minutes (a silent period while a tool executes, for example), a load balancer in front of the runtime terminates the TCP connection.

The timeout is on data flowing through the stream, not on the request total duration. As long as you emit bytes periodically, the connection stays open.

Fix: emit keepalive events during long-running tool executions.

Python pattern for a streaming entrypoint:

python

import asyncio
import json
from bedrock_agentcore.runtime import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

async def emit_keepalive(tool_task):
    """Yield heartbeat events every 30s while tool_task is running."""
    while not tool_task.done():
        yield f"data: {json.dumps({'type': 'heartbeat'})}\n\n"
        try:
            await asyncio.wait_for(asyncio.shield(tool_task), timeout=30)
        except asyncio.TimeoutError:
            continue  # tool still running, emit another heartbeat

@app.entrypoint
async def invoke(payload, context):
    async def stream():
        tool_task = asyncio.create_task(run_long_tool(payload))

        # Emit heartbeats while the tool runs
        async for event in emit_keepalive(tool_task):
            yield event

        # Tool completed — emit the real result
        result = await tool_task
        yield f"data: {json.dumps({'type': 'result', 'content': result})}\n\n"
        yield "data: [DONE]\n\n"

    return stream()

Pick a heartbeat interval of ~30 seconds. Too long risks hitting the idle timeout; too short wastes bandwidth.

On the client side, filter heartbeat events before surfacing bytes to the user:

python

for chunk in response.iter_lines():
    if not chunk:
        continue
    data = json.loads(chunk.removeprefix(b"data: "))
    if data.get("type") == "heartbeat":
        continue  # ignore keepalives
    # process real events

Alternative: use the SDK's async task API for fire-and-forget patterns. If the client doesn't need to wait for the result, register the work via

add_async_task

complete_async_task

and return the invocation immediately. See

agents-harden

Long-running background tasks section.

你的Agent使用SSE或长轮询响应，连接在流传输中途断开。客户端代码中的症状：

RemoteProtocolError: peer closed connection without sending complete message body

迭代流时出现
```
IncompleteRead
```
异常
静默断开——无错误、无
```
[DONE]
```
事件，响应突然停止
在多工具使用对话中发生（5次以上连续工具调用）
在客户端超时前很久就失败

根本原因： 流式连接的基础设施层空闲超时。如果响应流中几分钟没有数据流动（例如工具执行期间的静默期），运行时前端的负载均衡器会终止TCP连接。

超时是基于流中流动的数据，而非请求总时长。只要定期发送字节，连接就会保持打开。

修复方法： 在长时间运行的工具执行期间发送保活事件。

流式入口点的Python模式：

python

import asyncio
import json
from bedrock_agentcore.runtime import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

async def emit_keepalive(tool_task):
    """在tool_task运行时每30秒生成一次心跳事件。"""
    while not tool_task.done():
        yield f"data: {json.dumps({'type': 'heartbeat'})}\n\n"
        try:
            await asyncio.wait_for(asyncio.shield(tool_task), timeout=30)
        except asyncio.TimeoutError:
            continue  # 工具仍在运行，生成下一次心跳

@app.entrypoint
async def invoke(payload, context):
    async def stream():
        tool_task = asyncio.create_task(run_long_tool(payload))

        # 工具运行时生成心跳
        async for event in emit_keepalive(tool_task):
            yield event

        # 工具完成——生成实际结果
        result = await tool_task
        yield f"data: {json.dumps({'type': 'result', 'content': result})}\n\n"
        yield "data: [DONE]\n\n"

    return stream()

选择约30秒的心跳间隔。间隔太长可能触发空闲超时；太短会浪费带宽。

在客户端，过滤心跳事件再将内容呈现给用户：

python

for chunk in response.iter_lines():
    if not chunk:
        continue
    data = json.loads(chunk.removeprefix(b"data: "))
    if data.get("type") == "heartbeat":
        continue  # 忽略保活事件
    # 处理实际事件

替代方案： 对即发即弃模式使用SDK的异步任务API。如果客户端不需要等待结果，通过

add_async_task

complete_async_task

注册任务并立即返回调用结果。请查看

agents-harden

技能的长时间运行后台任务部分。

Symptom: Traces appear merged across concurrent agent invocations

症状：并发Agent调用的追踪数据显示合并

You run multiple agent invocations in parallel with unique

runtimeSessionId

values, but the AI Observability dashboard groups them as one session — making it impossible to isolate a single run. Data plane logs show the session IDs are correctly unique 1:1 with request IDs, but the trace view still merges them.

Most common cause: the caller isn't enabling Active Tracing, so upstream spans arrive with

Sampled=0

. AgentCore respects upstream trace-sampling decisions by default. If the parent context says "don't sample," spans drop and concurrent invocations can appear merged in the dashboard.

Fix by caller type:

Lambda caller: Enable Active Tracing on the Lambda function.

bash

aws lambda update-function-configuration \
  --function-name my-caller-function \
  --tracing-config Mode=Active

Or in the Lambda console: Configuration → Monitoring and operations tools → AWS X-Ray → Active tracing.

ECS / EC2 / container caller: Initialize the AWS X-Ray SDK and ensure outbound calls to AgentCore are instrumented. For Python, use

aws-xray-sdk

and patch the SDK:

python

from aws_xray_sdk.core import xray_recorder, patch_all
patch_all()  # patches boto3, requests, etc.

Direct SDK caller without X-Ray: If you can't enable upstream tracing, force the runtime to sample by setting an environment variable on the agent:

OTEL_TRACES_SAMPLER=always_on

This makes the runtime sample every trace regardless of the parent context's sampling decision. Trade-off: higher tracing costs, but the traces are correct.

你使用唯一的

runtimeSessionId

值并行运行多个Agent调用，但AI可观测性仪表板将它们分组为一个会话——无法隔离单个运行。数据平面日志显示会话ID与请求ID正确对应，但追踪视图仍显示合并。

最常见原因： 调用方未启用主动追踪，因此上游Span以

Sampled=0

到达。AgentCore默认尊重上游追踪采样决策。如果父上下文表示“不采样”，Span会被丢弃，并发调用在仪表板中可能显示为合并。

按调用方类型修复：

Lambda调用方： 在Lambda函数上启用主动追踪。

bash

aws lambda update-function-configuration \
  --function-name my-caller-function \
  --tracing-config Mode=Active

或在Lambda控制台中：配置 → 监控和操作工具 → AWS X-Ray → 主动追踪。

ECS / EC2 / 容器调用方： 初始化AWS X-Ray SDK并确保对AgentCore的出站调用已被检测。对于Python，使用

aws-xray-sdk

并修补SDK：

python

from aws_xray_sdk.core import xray_recorder, patch_all
patch_all()  # 修补boto3、requests等

无X-Ray的直接SDK调用方： 如果无法启用上游追踪，可通过在Agent上设置环境变量强制运行时采样：

OTEL_TRACES_SAMPLER=always_on

这会使运行时对所有追踪进行采样，无论父上下文的采样决策如何。权衡：更高的追踪成本，但追踪数据正确。

Also check: invoking with the endpoint ARN instead of the agent ARN

还要检查：使用端点ARN而非Agent ARN调用

If traces show only a single top-level

AgentCore.Runtime.Invoke

span with no child spans, check the ARN your caller is using. The invoke target should be the agent runtime ARN:

arn:aws:bedrock-agentcore:<region>:<account>:runtime/<runtime-name>

Not the endpoint ARN:

arn:aws:bedrock-agentcore:<region>:<account>:runtime/<runtime-name>/runtime-endpoint/DEFAULT

Invoking with the endpoint ARN can bypass the full trace instrumentation path. This is a subtle trap — both ARNs produce successful responses, but only the agent ARN produces complete traces.

如果追踪仅显示单个顶级

AgentCore.Runtime.Invoke

Span且无子Span，请检查调用方使用的ARN。调用目标应为Agent运行时ARN：

arn:aws:bedrock-agentcore:<region>:<account>:runtime/<runtime-name>

而非端点ARN：

arn:aws:bedrock-agentcore:<region>:<account>:runtime/<runtime-name>/runtime-endpoint/DEFAULT

使用端点ARN调用可能会绕过完整的追踪检测路径。这是一个微妙的陷阱——两个ARN都能产生成功响应，但只有Agent ARN能产生完整的追踪数据。

Symptom: Runtime stuck in DELETING for hours

症状：运行时卡在DELETING状态数小时

You called

DeleteAgentRuntime

, got a successful response with

status: DELETING

, and the runtime has been stuck in that state for more than 30 minutes. Attempting to delete the default endpoint separately returns

ConflictException: Default endpoints are removed when you delete the agent.

What's happening: The deletion workflow is stuck on the service side. Retrying

DeleteAgentRuntime

won't help — the call succeeds immediately (returning DELETING) but the back-end workflow is the thing that's stuck. Customer-side tooling can't force-complete it.

What to do:

Do not keep retrying. It won't unstick the workflow.
Open an AWS Support case at https://console.aws.amazon.com/support. Include:
- AWS Account ID
- Region
- Runtime ARN (or
```
agentRuntimeId
```
  )
- The
```
requestId
```
  and timestamp of the original
```
DeleteAgentRuntime
```
  call (from CloudTrail)
- How long the runtime has been in DELETING state
Work around it in the meantime. Deploy a new runtime with a different name if you need to keep shipping. Don't let the stuck resource block your work.

Orphaned resources from a stuck deletion (ENIs, workload identities) may need manual cleanup from the service team as part of the same case.

你调用了

DeleteAgentRuntime

，收到

status: DELETING

的成功响应，但运行时在此状态停留超过30分钟。尝试单独删除默认端点返回

ConflictException: Default endpoints are removed when you delete the agent.

实际情况： 删除工作流在服务端卡住。重试

DeleteAgentRuntime

无济于事——调用立即成功（返回DELETING）但后端工作流已卡住。客户侧工具无法强制完成它。

解决方法：

不要持续重试。 这无法解除工作流的卡住状态。
打开AWS支持工单，访问https://console.aws.amazon.com/support。提供：
- AWS账户ID
- 区域
- 运行时ARN（或
```
agentRuntimeId
```
  ）
- 原始
```
DeleteAgentRuntime
```
  调用的
```
requestId
```
  和时间戳（来自CloudTrail）
- 运行时处于DELETING状态的时长
同时进行临时处理。 如果需要继续工作，部署一个不同名称的新运行时。不要让卡住的资源阻碍你的工作。

卡住的删除操作产生的孤立资源（ENI、工作负载身份）可能需要服务团队在同一工单中手动清理。

Framework-specific issues

框架特定问题

LangGraph — model format: Older versions of

langchain-aws

required the model ID without the cross-region prefix. Recent versions may support cross-region inference profiles — check your installed version:

bash

pip show langchain-aws | grep Version

If you hit model errors with LangGraph, try the non-prefixed ID:

python

undefined

LangGraph — 模型格式： 旧版本的

langchain-aws

要求模型ID不带跨区域前缀。新版本可能支持跨区域推理配置文件——检查你的安装版本：

bash

pip show langchain-aws | grep Version

如果使用LangGraph时遇到模型错误，尝试使用无前缀的ID：

python

undefined

If cross-region prefix errors in your langchain-aws version:

如果你的langchain-aws版本不支持跨区域前缀：

llm = init_chat_model("anthropic.claude-sonnet-4-5-20250929-v1:0", model_provider="bedrock_converse")

If your version supports cross-region profiles (us. = US, eu. = Europe, apac. = Asia Pacific, global. = worldwide):

如果你的版本支持跨区域配置文件（us. = 美国, eu. = 欧洲, apac. = 亚太地区, global. = 全球）：

llm = init_chat_model("global.anthropic.claude-sonnet-4-5-20250929-v1:0", ...)


Verify against the current langchain-aws release notes: https://github.com/langchain-ai/langchain-aws/releases — cross-region inference profile support has been evolving.

**Google ADK — Gemini only:**
ADK only works with Gemini models. If you're seeing model errors with ADK, check that `GEMINI_API_KEY` is set and you're using a `gemini-*` model ID.

**A2A agents — wrong port:**
A2A servers must run on port 9000. If your A2A agent isn't responding, check it's not accidentally running on 8080.

---

llm = init_chat_model("global.anthropic.claude-sonnet-4-5-20250929-v1:0", ...)


请参考当前langchain-aws发布说明：https://github.com/langchain-ai/langchain-aws/releases ——跨区域推理配置文件支持一直在演进。

**Google ADK — 仅支持Gemini：**
ADK仅适用于Gemini模型。如果使用ADK时遇到模型错误，请检查是否设置了`GEMINI_API_KEY`且使用的是`gemini-*`模型ID。

**A2A Agent — 端口错误：**
A2A服务器必须运行在9000端口。如果你的A2A Agent无响应，请检查它是否意外运行在8080端口。

---

Reading a trace

读取追踪数据

A trace shows the full execution path of one agent invocation. Key sections:

Model invocations — what the model was asked and what it responded
Tool calls — which tools were called, with what inputs, and what they returned
Memory operations — what was read from and written to memory
Policy decisions — what was allowed or denied (if policy engine is attached)
Latency breakdown — time spent in each component

bash

undefined

追踪数据显示一次Agent调用的完整执行路径。关键部分：

模型调用 —— 模型收到的请求和返回的响应
工具调用 —— 调用的工具、输入参数和返回结果
内存操作 —— 从内存读取和写入的内容
策略决策 —— 允许或拒绝的操作（如果附加了策略引擎）
延迟细分 —— 每个组件消耗的时间

bash

undefined

Download trace to a file for detailed inspection

将追踪数据下载到文件进行详细检查

agentcore traces get <traceId> --runtime <AgentName> --output trace.json cat trace.json | jq '.trace.orchestrationTrace.modelInvocationOutput'

undefined

agentcore traces get <traceId> --runtime <AgentName> --output trace.json cat trace.json | jq '.trace.orchestrationTrace.modelInvocationOutput'

undefined

Output

输出

Diagnosis of the specific failure with root cause
Specific fix commands or code changes
Explanation of what the trace shows (if reading traces)
Handoff to the appropriate skill when the fix is outside debug's scope

特定故障的诊断及根本原因
具体的修复命令或代码更改
追踪数据显示内容的解释（如果读取了追踪数据）
当修复超出调试技能范围时，移交到相应技能

After diagnosis — handoff

诊断后——移交

Once you've identified the root cause, hand off to the skill that owns the fix:

Root cause	Hand off to	Detail
Memory misconfigured (wrong strategy, namespace, wiring)	`agents-build`	Load `references/memory.md`
Agent invocation from app not working (auth, URL, streaming)	`agents-build`	Load `references/integrate.md`
VPC connectivity (can't reach RDS, no internet, AZ error)	`agents-build`	Load `references/vpc.md`
Multi-agent delegation not working	`agents-build`	Load `references/multi-agent.md`
Custom request headers not reaching agent code	`agents-build`	Load `references/request-headers.md`
Cross-account invocation from an app in another account	`agents-build`	Load `references/integrate.md` (cross-account section)
Gateway auth misconfigured (401, wrong auth type)	`agents-connect`	Gateway auth matrix
Gateway target type question (Lambda vs OpenAPI vs MCP vs API Gateway)	`agents-connect`	"What Gateway is and isn't" section
Policy denying unexpectedly (Cedar, access denied on tool)	`agents-connect`	Load `references/policy.md`
Observability not set up (no logs, no traces appearing)	`agents-optimize`	Load `references/observability.md`
Cold start / initialization too slow	`agents-harden`	Initialization time section
Session lifecycle / `maxVms` / `StopRuntimeSession`	`agents-harden`	Session lifecycle management section
Long-running background tasks being reclaimed	`agents-harden`	Long-running background tasks section
JWT inbound auth failing (403, `allowedClients` / `allowedAudience` , issuer mismatch)	`agents-harden`	Inbound auth section
Throttling / quota error / limit increase request	`agents-harden`	Load `references/limits.md`
Deploy artifact stale or wrong version	`agents-deploy`	Redeploy workflow
Environment broken (CLI, credentials, Node, uv)	Load `references/doctor.md`	Self-contained in this skill

State the diagnosis clearly, then tell the developer which skill to use next. If the agent can load the referenced skill in the same session, do so.

一旦确定根本原因，移交到负责修复的技能：

根本原因	移交到	详情
内存配置错误（策略错误、命名空间错误、 wiring错误）	`agents-build`	加载 `references/memory.md`
应用调用Agent失败（认证、URL、流式传输）	`agents-build`	加载 `references/integrate.md`
VPC连接问题（无法访问RDS、无互联网、AZ错误）	`agents-build`	加载 `references/vpc.md`
多Agent委托失败	`agents-build`	加载 `references/multi-agent.md`
自定义请求头未到达Agent代码	`agents-build`	加载 `references/request-headers.md`
跨账户调用（来自另一个账户的应用）	`agents-build`	加载 `references/integrate.md` （跨账户部分）
网关认证配置错误（401、认证类型错误）	`agents-connect`	网关认证矩阵
网关目标类型问题（Lambda vs OpenAPI vs MCP vs API Gateway）	`agents-connect`	“网关能做什么和不能做什么”部分
策略意外拒绝（Cedar、工具访问被拒绝）	`agents-connect`	加载 `references/policy.md`
可观测性未设置（无日志、无追踪数据显示）	`agents-optimize`	加载 `references/observability.md`
冷启动/初始化过慢	`agents-harden`	初始化时间部分
会话生命周期 / `maxVms` / `StopRuntimeSession`	`agents-harden`	会话生命周期管理部分
长时间运行的后台任务被回收	`agents-harden`	长时间运行后台任务部分
JWT入站认证失败（403、 `allowedClients` / `allowedAudience` 、发行方不匹配）	`agents-harden`	入站认证部分
限流/配额错误/请求配额增加	`agents-harden`	加载 `references/limits.md`
部署工件过期或版本错误	`agents-deploy`	重新部署流程
环境损坏（CLI、凭证、Node、uv）	加载 `references/doctor.md`	本技能内自包含

清晰说明诊断结果，然后告知开发者下一步使用哪个技能。如果Agent能在同一会话中加载相关技能，请直接加载。

agents-debug

Original

Translation

debug

调试

When to use

适用场景

Input

输入

Process

流程

Step 0: Determine problem type

步骤0：确定问题类型

Step 1: Verify CLI version

步骤1：验证CLI版本

Step 2: Understand the symptom

步骤2：了解症状

Step 3: Read traces and logs automatically

步骤3：自动读取追踪和日志

List recent traces

列出最近的追踪数据

Get the most recent trace ID

获取最新的追踪ID

Download and read the trace

下载并读取追踪数据

Search logs for errors

搜索日志中的错误

Search logs for a specific pattern

搜索日志中的特定模式

Step 4: Diagnose by symptom

步骤4：按症状诊断

Symptom: "model access denied" or model error

症状："model access denied"或模型错误

Symptom: Tool call failing

症状：工具调用失败

Symptom: Wrong or unhelpful answers

症状：返回错误或无用的答案

Symptom: Memory not working

症状：内存无法正常工作

Symptom: Agent timeout

症状：Agent超时

Symptom: ServiceQuotaExceededException: maxVms limit exceeded (despite low observed concurrency)

症状：ServiceQuotaExceededException: maxVms limit exceeded（尽管观察到的并发量很低）

Symptom: 424 Failed Dependency on invoke

症状：调用时出现424 Failed Dependency

Symptom: Local invocations fail with connection-refused / exit code 7

症状：本地调用失败，出现connection-refused / exit code 7

Symptom: Gateway tool calls failing with auth errors

症状：网关工具调用因认证错误失败

Symptom: No traces appearing

症状：没有追踪数据显示

Symptom: CloudWatch logs not appearing

症状：CloudWatch日志未显示

✅ Correct — wrapped with opentelemetry-instrument

✅ 正确——用opentelemetry-instrument包裹

❌ Wrong — no OTEL wrapper, logs won't appear

❌ 错误——没有OTEL包裹，日志不会显示

✅ This appears in CloudWatch

✅ 此内容会出现在CloudWatch中

❌ This does NOT appear in CloudWatch

❌ 此内容不会出现在CloudWatch中

Logs missing for Terraform/CDK/IaC-deployed runtimes

Terraform/CDK/IaC部署的运行时日志缺失

Symptom: Streaming connection drops mid-response

症状：流式连接在响应中途断开

Symptom: Traces appear merged across concurrent agent invocations

症状：并发Agent调用的追踪数据显示合并

Also check: invoking with the endpoint ARN instead of the agent ARN

还要检查：使用端点ARN而非Agent ARN调用

Symptom: Runtime stuck in DELETING for hours

症状：运行时卡在DELETING状态数小时

Framework-specific issues

框架特定问题

If cross-region prefix errors in your langchain-aws version:

如果你的langchain-aws版本不支持跨区域前缀：

If your version supports cross-region profiles (us. = US, eu. = Europe, apac. = Asia Pacific, global. = worldwide):

如果你的版本支持跨区域配置文件（us. = 美国, eu. = 欧洲, apac. = 亚太地区, global. = 全球）：

Reading a trace

读取追踪数据

Download trace to a file for detailed inspection

Symptom:
`ServiceQuotaExceededException: maxVms limit exceeded`
(despite low observed concurrency)

症状：
`ServiceQuotaExceededException: maxVms limit exceeded`
（尽管观察到的并发量很低）