truefoundry-monitor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.

路由说明：对于模糊的用户意图，请使用references/intent-clarification.md中的通用澄清模板。

Monitor Deployment

监控部署

Track a TrueFoundry deployment rollout to completion, reporting status at each stage and diagnosing failures automatically.

跟踪TrueFoundry部署发布直至完成，在每个阶段汇报状态并自动诊断失败原因。

When to Use

适用场景

After
```
tfy apply
```
or
```
tfy deploy
```
to track rollout progress
User says "monitor", "watch deployment", "is my deploy done", "check rollout"
Called automatically by the
```
deploy
```
skill after a successful apply/deploy

执行
```
tfy apply
```
或
```
tfy deploy
```
后跟踪发布进度
用户提及“监控”、“查看部署状态”、“我的部署完成了吗”、“检查发布情况”时
在
```
deploy
```
技能成功完成apply/deploy操作后自动调用

When NOT to Use

不适用场景

User wants to deploy → prefer
```
deploy
```
skill; ask if the user wants another valid path
User wants to list all apps → prefer
```
applications
```
skill; ask if the user wants another valid path
User wants to read historical logs → prefer
```
logs
```
skill; ask if the user wants another valid path

</objective> <instructions>

用户想要部署→优先使用
```
deploy
```
技能；询问用户是否需要其他有效操作路径
用户想要列出所有应用→优先使用
```
applications
```
技能；询问用户是否需要其他有效操作路径
用户想要查看历史日志→优先使用
```
logs
```
技能；询问用户是否需要其他有效操作路径

</objective> <instructions>

CRITICAL BEHAVIOR RULES

核心行为规则

RULE 1: Once monitoring starts, you MUST poll until a terminal state or timeout. Do NOT stop early. Do NOT ask the user "should I keep checking?" — just keep checking.

RULE 2: Do NOT end your response while the deployment is in a non-terminal state (BUILDING, INITIALIZED, ROLLOUT_STARTED). If you are about to stop and the status is non-terminal, you are violating this rule — continue polling.

RULE 3: Between each poll, briefly tell the user what you're waiting for. Do NOT silently loop, but also do NOT ask for permission to continue.

规则1：一旦监控启动，必须持续轮询直到进入终端状态或超时。不得提前停止。不得询问用户“是否继续检查？”——直接持续检查即可。

规则2：当部署处于非终端状态（BUILDING、INITIALIZED、ROLLOUT_STARTED）时，不得终止响应。如果即将停止但状态仍为非终端，则违反本规则——需继续轮询。

规则3：每次轮询间隔期间，需简要告知用户当前等待的内容。不得静默循环，但也无需请求继续权限。

Required Information

必要信息

Before monitoring, you need:

Workspace FQN (
```
TFY_WORKSPACE_FQN
```
) — HARD RULE: Never auto-pick. Always ask the user to confirm.
Application name — the service or job name being deployed

If invoked right after a deploy, both should already be known from the deploy context.

开始监控前，你需要：

工作区FQN（
```
TFY_WORKSPACE_FQN
```
）——硬性规则：不得自动选择。必须要求用户确认。
应用名称——正在部署的服务或任务名称

如果是在部署完成后立即调用监控，这两项信息应已从部署上下文中获取。

Execution Priority

执行优先级

For all status checks, use MCP tool calls first:

tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "APP_NAME"})

If MCP tool calls are unavailable, fall back to direct API via

tfy-api.sh

When using direct API, set

TFY_API_SH

to the full path of this skill's

scripts/tfy-api.sh

. See

references/tfy-api-setup.md

for paths per agent.

所有状态检查优先使用MCP工具调用：

tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "APP_NAME"})

若MCP工具调用不可用，则回退到通过

tfy-api.sh

直接调用API。

使用直接API时，需将

TFY_API_SH

设置为该技能的

scripts/tfy-api.sh

的完整路径。请参考

references/tfy-api-setup.md

查看各Agent对应的路径。

Monitoring Flow

监控流程

Step 1: Initial Status Check

步骤1：初始状态检查

bash

TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
bash $TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=APP_NAME'

Extract from the response at

data[0]

(the application object):

```
deployment.currentStatus.status
```
— the deployment status enum

deployment.currentStatus.transition

— current transition (e.g.,

BUILDING

DEPLOYING

)

deployment.currentStatus.state.isTerminalState

— boolean, most reliable terminal check

```
deployment.currentStatus.state.display
```
— human-readable state

bash

TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
bash $TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=APP_NAME'

从响应的

data[0]

（应用对象）中提取：

```
deployment.currentStatus.status
```
——部署状态枚举值

deployment.currentStatus.transition

——当前过渡状态（例如：

BUILDING

、

DEPLOYING

）

```
deployment.currentStatus.state.isTerminalState
```
——布尔值，最可靠的终端状态判断依据
```
deployment.currentStatus.state.display
```
——人性化可读状态

Step 2: Poll Until Terminal State

步骤2：轮询直至终端状态

The API response has two key fields:

status

(the deployment status) and

transition

(what's happening now). Use

state.isTerminalState

as the authoritative check for whether to stop polling.

Status values (from

deployment.currentStatus.status

Status	Terminal?	Action
`INITIALIZED`	No	Report "Deployment initialized, waiting...", continue polling
`BUILDING`	No	Report "Build in progress", continue polling
`BUILD_SUCCESS`	No	Report "Build succeeded, deploying...", continue polling
`BUILD_FAILED`	Yes	Fetch build logs, report failure
`ROLLOUT_STARTED`	No	Report "Rollout started", continue polling
`DEPLOY_SUCCESS`	Yes	Report success with endpoint URL
`DEPLOY_FAILED`	Yes	Fetch pod logs, diagnose failure
`DEPLOY_FAILED_WITH_RETRY`	No	Report "Deploy failed, retrying...", continue polling
`PAUSED`	Yes	Report paused/stopped
`FAILED`	Yes	Report general failure
`CANCELLED`	Yes	Report cancelled

Transition values (from

deployment.currentStatus.transition

Transition	Meaning
`BUILDING`	Image build is in progress
`DEPLOYING`	Pods are being created/updated
`REUSING_EXISTING_BUILD`	Skipping build, reusing cached image
`COMPONENTS_DEPLOYING`	Multi-component deployment in progress
`WAITING`	Waiting for resources

Best practice: Always check
deployment.currentStatus.state.isTerminalState === true
to decide whether to stop polling, rather than matching individual status strings. The
state.display
field gives a human-friendly label.

Polling schedule:

First 2 minutes: check every 15 seconds
Minutes 2-5: check every 30 seconds
After 5 minutes: check every 60 seconds
Timeout after 10 minutes — report current state and suggest the user check manually

Between polls, tell the user what you're waiting for. Do not silently loop. Do NOT ask "should I continue?" — just continue.

API响应包含两个关键字段：

status

（部署状态）和

transition

（当前操作）。需使用

state.isTerminalState

作为是否停止轮询的权威判断标准。

状态值（来自

deployment.currentStatus.status

）：

状态值	是否为终端状态	操作
`INITIALIZED`	否	汇报“部署已初始化，等待中...”，继续轮询
`BUILDING`	否	汇报“镜像构建中”，继续轮询
`BUILD_SUCCESS`	否	汇报“镜像构建成功，部署中...”，继续轮询
`BUILD_FAILED`	是	获取构建日志，汇报失败信息
`ROLLOUT_STARTED`	否	汇报“发布已启动”，继续轮询
`DEPLOY_SUCCESS`	是	汇报成功状态及端点URL
`DEPLOY_FAILED`	是	获取Pod日志，诊断失败原因
`DEPLOY_FAILED_WITH_RETRY`	否	汇报“部署失败，重试中...”，继续轮询
`PAUSED`	是	汇报已暂停/已停止
`FAILED`	是	汇报通用失败信息
`CANCELLED`	是	汇报已取消

过渡状态值（来自

deployment.currentStatus.transition

）：

过渡状态	含义
`BUILDING`	镜像构建中
`DEPLOYING`	Pod正在创建/更新
`REUSING_EXISTING_BUILD`	跳过构建，复用缓存镜像
`COMPONENTS_DEPLOYING`	多组件部署中
`WAITING`	等待资源分配

最佳实践：始终通过检查
deployment.currentStatus.state.isTerminalState === true
来决定是否停止轮询，而非匹配单个状态字符串。
state.display
字段提供人性化的状态标签。

轮询计划：

前2分钟：每15秒检查一次
第2-5分钟：每30秒检查一次
5分钟后：每60秒检查一次
10分钟后超时——汇报当前状态并建议用户手动检查

轮询间隔期间，告知用户当前等待的内容。不得静默循环，也不得询问“是否继续？”——直接继续即可。

Step 3: On Success

步骤3：部署成功时

When

state.isTerminalState

true

and status is

DEPLOY_SUCCESS

Report the final status
Show replicas ready (e.g., "2/2 replicas ready")
Show the endpoint URL if the service has an exposed port
Optionally run a quick health check on the endpoint:

bash

undefined

当

state.isTerminalState

为

true

且状态为

DEPLOY_SUCCESS

时：

汇报最终状态
显示就绪副本数（例如：“2/2 副本就绪”）
若服务暴露端口，则显示端点URL
可选择对端点执行快速健康检查：

bash

undefined

Only if the service exposes an HTTP port

仅当服务暴露HTTP端口时执行

curl -sf -o /dev/null -w '%{http_code}' "https://ENDPOINT_URL/health" || true


Report the HTTP status code. Do not fail the monitor if the health check fails — just report it.

curl -sf -o /dev/null -w '%{http_code}' "https://ENDPOINT_URL/health" || true


汇报HTTP状态码。即使健康检查失败，也不要终止监控——仅需汇报结果即可。

Step 4: On Failure

步骤4：部署失败时

When status is

BUILD_FAILED

DEPLOY_FAILED

FAILED

, or

CANCELLED

Fetch recent logs using the
```
logs
```
skill or direct API:

bash

undefined

当状态为

BUILD_FAILED

、

DEPLOY_FAILED

、

FAILED

或

CANCELLED

时：

获取近期日志：使用
```
logs
```
技能或直接调用API：

bash

undefined

Get the app ID first from the status response

首先从状态响应中获取应用ID

TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh

Fetch recent logs (last 5 minutes)

获取近期日志（最近5分钟）

bash $TFY_API_SH GET '/api/svc/v1/logs/WORKSPACE_ID/download?applicationFqn=APP_FQN&startTs=START_TS&endTs=END_TS'


2. **Identify the failure cause** from the logs (OOMKilled, CrashLoopBackOff, ImagePullBackOff, port mismatch, etc.)
3. **Suggest a fix** based on the error:

| Error Pattern | Suggested Fix |
|---------------|---------------|
| `OOMKilled` | Increase `memory_limit` in manifest |
| `CrashLoopBackOff` | Check startup command and logs for crash reason |
| `ImagePullBackOff` | Verify image URI and registry credentials |
| Port mismatch | Ensure manifest port matches what the app listens on |
| `Readiness probe failed` | Check health probe path and startup time |
| Build error | Check Dockerfile and build logs |

4. **Report summary** with: error type, relevant log excerpt (max 20 lines), and suggested fix
5. **Do NOT auto-retry.** Present the diagnosis and let the user decide next steps.

bash $TFY_API_SH GET '/api/svc/v1/logs/WORKSPACE_ID/download?applicationFqn=APP_FQN&startTs=START_TS&endTs=END_TS'


2. **从日志中识别失败原因**（OOMKilled、CrashLoopBackOff、ImagePullBackOff、端口不匹配等）
3. **根据错误给出修复建议**：

| 错误模式 | 建议修复方案 |
|---------------|---------------|
| `OOMKilled` | 在清单中增加`memory_limit` |
| `CrashLoopBackOff` | 检查启动命令及崩溃原因日志 |
| `ImagePullBackOff` | 验证镜像URI及镜像仓库凭证 |
| 端口不匹配 | 确保清单端口与应用监听端口一致 |
| `Readiness probe failed` | 检查健康探测路径及启动时间 |
| 构建错误 | 检查Dockerfile及构建日志 |

4. **汇报总结信息**：包含错误类型、相关日志片段（最多20行）及修复建议
5. **不得自动重试**。呈现诊断结果，由用户决定后续操作。

Presenting Status Updates

状态更新展示格式

Use a consistent format for each status update:

Monitoring: my-service in cluster:workspace
Status: ROLLOUT_STARTED | Transition: DEPLOYING
Display: Deploying (1/2 replicas ready)
Elapsed: 45s
Next check in 15s...

Final summary on success:

Deployment complete: my-service
Status: DEPLOY_SUCCESS
Replicas: 2/2 ready
Endpoint: https://my-service-ws.example.com
Health check: 200 OK
Total time: 1m 32s

Final summary on failure:

Deployment failed: my-service
Status: DEPLOY_FAILED
Error: CrashLoopBackOff — container exited with code 1
Log excerpt:
  > ModuleNotFoundError: No module named 'flask'
Suggested fix: Add 'flask' to requirements.txt and redeploy

</instructions>

<success_criteria>

每次状态更新使用统一格式：

监控中：my-service in cluster:workspace
状态：ROLLOUT_STARTED | 过渡状态：DEPLOYING
显示：部署中（1/2 副本就绪）
已耗时：45s
下次检查将在15秒后...

部署成功时的最终总结：

部署完成：my-service
状态：DEPLOY_SUCCESS
副本数：2/2 就绪
端点：https://my-service-ws.example.com
健康检查：200 OK
总耗时：1分32秒

部署失败时的最终总结：

部署失败：my-service
状态：DEPLOY_FAILED
错误：CrashLoopBackOff — 容器以代码1退出
日志片段：
  > ModuleNotFoundError: No module named 'flask'
修复建议：将'flask'添加到requirements.txt后重新部署

</instructions>

<success_criteria>

Success Criteria

成功标准

Deployment status is tracked from current state to a terminal state
User sees clear progress updates at each polling interval
On success: replicas, endpoint URL, and optional health check are reported
On failure: logs are fetched, root cause is identified, and a fix is suggested
Monitor times out gracefully after 10 minutes with a status summary
The user is never left waiting without feedback

</success_criteria>

部署状态从当前状态跟踪至终端状态
用户在每个轮询间隔都能看到清晰的进度更新
部署成功时：汇报副本数、端点URL及可选的健康检查结果
部署失败时：获取日志、识别根本原因并给出修复建议
监控在10分钟后优雅超时并汇报状态总结
不会让用户在无反馈的情况下等待

</success_criteria>

Composability

可组合性

Before monitoring: Use
```
deploy
```
skill to deploy, then monitor
On failure: Use
```
logs
```
skill for deeper log analysis
Check app details: Use
```
applications
```
skill for full app info
Fix and redeploy: Use
```
deploy
```
skill to apply fixes

</references> <troubleshooting>

监控前：使用
```
deploy
```
技能完成部署，再启动监控
失败时：使用
```
logs
```
技能进行深度日志分析
查看应用详情：使用
```
applications
```
技能获取完整应用信息
修复并重部署：使用
```
deploy
```
技能应用修复方案

</references> <troubleshooting>

Error Handling

错误处理

Application Not Found

应用未找到

Application "APP_NAME" not found in workspace "WORKSPACE_FQN".
Check:
- Application name is spelled correctly
- The deploy/apply command completed successfully
- You're checking the correct workspace

在工作区"WORKSPACE_FQN"中未找到应用"APP_NAME"。
请检查：
- 应用名称拼写是否正确
- deploy/apply命令是否执行成功
- 是否在正确的工作区中检查

Timeout

超时

Monitoring timed out after 10 minutes.
Current status: ROLLOUT_STARTED | Transition: DEPLOYING
The deployment is still in progress. Check manually:
- TrueFoundry dashboard: TFY_BASE_URL
- Or run this skill again to resume monitoring

监控已超时（10分钟）。
当前状态：ROLLOUT_STARTED | 过渡状态：DEPLOYING
部署仍在进行中。请手动检查：
- TrueFoundry控制台：TFY_BASE_URL
- 或再次调用该技能恢复监控

Permission Denied

权限不足

Cannot access this application. Check your API key permissions for this workspace.

</troubleshooting>

无法访问该应用。请检查你的API密钥对该工作区的权限。

</troubleshooting>