truefoundry-monitor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.
<objective>
路由说明:对于模糊的用户意图,请使用references/intent-clarification.md中的通用澄清模板。
<objective>

Monitor Deployment

监控部署

Track a TrueFoundry deployment rollout to completion, reporting status at each stage and diagnosing failures automatically.
跟踪TrueFoundry部署发布直至完成,在每个阶段汇报状态并自动诊断失败原因。

When to Use

适用场景

  • After
    tfy apply
    or
    tfy deploy
    to track rollout progress
  • User says "monitor", "watch deployment", "is my deploy done", "check rollout"
  • Called automatically by the
    deploy
    skill after a successful apply/deploy
  • 执行
    tfy apply
    tfy deploy
    后跟踪发布进度
  • 用户提及“监控”、“查看部署状态”、“我的部署完成了吗”、“检查发布情况”时
  • deploy
    技能成功完成apply/deploy操作后自动调用

When NOT to Use

不适用场景

  • User wants to deploy → prefer
    deploy
    skill; ask if the user wants another valid path
  • User wants to list all apps → prefer
    applications
    skill; ask if the user wants another valid path
  • User wants to read historical logs → prefer
    logs
    skill; ask if the user wants another valid path
</objective> <instructions>
  • 用户想要部署→优先使用
    deploy
    技能;询问用户是否需要其他有效操作路径
  • 用户想要列出所有应用→优先使用
    applications
    技能;询问用户是否需要其他有效操作路径
  • 用户想要查看历史日志→优先使用
    logs
    技能;询问用户是否需要其他有效操作路径
</objective> <instructions>

CRITICAL BEHAVIOR RULES

核心行为规则

RULE 1: Once monitoring starts, you MUST poll until a terminal state or timeout. Do NOT stop early. Do NOT ask the user "should I keep checking?" — just keep checking.
RULE 2: Do NOT end your response while the deployment is in a non-terminal state (BUILDING, INITIALIZED, ROLLOUT_STARTED). If you are about to stop and the status is non-terminal, you are violating this rule — continue polling.
RULE 3: Between each poll, briefly tell the user what you're waiting for. Do NOT silently loop, but also do NOT ask for permission to continue.
规则1:一旦监控启动,必须持续轮询直到进入终端状态或超时。不得提前停止。不得询问用户“是否继续检查?”——直接持续检查即可。
规则2:当部署处于非终端状态(BUILDING、INITIALIZED、ROLLOUT_STARTED)时,不得终止响应。如果即将停止但状态仍为非终端,则违反本规则——需继续轮询。
规则3:每次轮询间隔期间,需简要告知用户当前等待的内容。不得静默循环,但也无需请求继续权限。

Required Information

必要信息

Before monitoring, you need:
  1. Workspace FQN (
    TFY_WORKSPACE_FQN
    ) — HARD RULE: Never auto-pick. Always ask the user to confirm.
  2. Application name — the service or job name being deployed
If invoked right after a deploy, both should already be known from the deploy context.
开始监控前,你需要:
  1. 工作区FQN
    TFY_WORKSPACE_FQN
    )——硬性规则:不得自动选择。必须要求用户确认。
  2. 应用名称——正在部署的服务或任务名称
如果是在部署完成后立即调用监控,这两项信息应已从部署上下文中获取。

Execution Priority

执行优先级

For all status checks, use MCP tool calls first:
tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "APP_NAME"})
If MCP tool calls are unavailable, fall back to direct API via
tfy-api.sh
.
When using direct API, set
TFY_API_SH
to the full path of this skill's
scripts/tfy-api.sh
. See
references/tfy-api-setup.md
for paths per agent.
所有状态检查优先使用MCP工具调用:
tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "APP_NAME"})
若MCP工具调用不可用,则回退到通过
tfy-api.sh
直接调用API。
使用直接API时,需将
TFY_API_SH
设置为该技能的
scripts/tfy-api.sh
的完整路径。请参考
references/tfy-api-setup.md
查看各Agent对应的路径。

Monitoring Flow

监控流程

Step 1: Initial Status Check

步骤1:初始状态检查

bash
TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
bash $TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=APP_NAME'
Extract from the response at
data[0]
(the application object):
  • deployment.currentStatus.status
    — the deployment status enum
  • deployment.currentStatus.transition
    — current transition (e.g.,
    BUILDING
    ,
    DEPLOYING
    )
  • deployment.currentStatus.state.isTerminalState
    — boolean, most reliable terminal check
  • deployment.currentStatus.state.display
    — human-readable state
bash
TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
bash $TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=APP_NAME'
从响应的
data[0]
(应用对象)中提取:
  • deployment.currentStatus.status
    ——部署状态枚举值
  • deployment.currentStatus.transition
    ——当前过渡状态(例如:
    BUILDING
    DEPLOYING
  • deployment.currentStatus.state.isTerminalState
    ——布尔值,最可靠的终端状态判断依据
  • deployment.currentStatus.state.display
    ——人性化可读状态

Step 2: Poll Until Terminal State

步骤2:轮询直至终端状态

The API response has two key fields:
status
(the deployment status) and
transition
(what's happening now). Use
state.isTerminalState
as the authoritative check for whether to stop polling.
Status values (from
deployment.currentStatus.status
):
StatusTerminal?Action
INITIALIZED
NoReport "Deployment initialized, waiting...", continue polling
BUILDING
NoReport "Build in progress", continue polling
BUILD_SUCCESS
NoReport "Build succeeded, deploying...", continue polling
BUILD_FAILED
YesFetch build logs, report failure
ROLLOUT_STARTED
NoReport "Rollout started", continue polling
DEPLOY_SUCCESS
YesReport success with endpoint URL
DEPLOY_FAILED
YesFetch pod logs, diagnose failure
DEPLOY_FAILED_WITH_RETRY
NoReport "Deploy failed, retrying...", continue polling
PAUSED
YesReport paused/stopped
FAILED
YesReport general failure
CANCELLED
YesReport cancelled
Transition values (from
deployment.currentStatus.transition
):
TransitionMeaning
BUILDING
Image build is in progress
DEPLOYING
Pods are being created/updated
REUSING_EXISTING_BUILD
Skipping build, reusing cached image
COMPONENTS_DEPLOYING
Multi-component deployment in progress
WAITING
Waiting for resources
Best practice: Always check
deployment.currentStatus.state.isTerminalState === true
to decide whether to stop polling, rather than matching individual status strings. The
state.display
field gives a human-friendly label.
Polling schedule:
  • First 2 minutes: check every 15 seconds
  • Minutes 2-5: check every 30 seconds
  • After 5 minutes: check every 60 seconds
  • Timeout after 10 minutes — report current state and suggest the user check manually
Between polls, tell the user what you're waiting for. Do not silently loop. Do NOT ask "should I continue?" — just continue.
API响应包含两个关键字段:
status
(部署状态)和
transition
(当前操作)。需使用
state.isTerminalState
作为是否停止轮询的权威判断标准。
状态值(来自
deployment.currentStatus.status
):
状态值是否为终端状态操作
INITIALIZED
汇报“部署已初始化,等待中...”,继续轮询
BUILDING
汇报“镜像构建中”,继续轮询
BUILD_SUCCESS
汇报“镜像构建成功,部署中...”,继续轮询
BUILD_FAILED
获取构建日志,汇报失败信息
ROLLOUT_STARTED
汇报“发布已启动”,继续轮询
DEPLOY_SUCCESS
汇报成功状态及端点URL
DEPLOY_FAILED
获取Pod日志,诊断失败原因
DEPLOY_FAILED_WITH_RETRY
汇报“部署失败,重试中...”,继续轮询
PAUSED
汇报已暂停/已停止
FAILED
汇报通用失败信息
CANCELLED
汇报已取消
过渡状态值(来自
deployment.currentStatus.transition
):
过渡状态含义
BUILDING
镜像构建中
DEPLOYING
Pod正在创建/更新
REUSING_EXISTING_BUILD
跳过构建,复用缓存镜像
COMPONENTS_DEPLOYING
多组件部署中
WAITING
等待资源分配
最佳实践:始终通过检查
deployment.currentStatus.state.isTerminalState === true
来决定是否停止轮询,而非匹配单个状态字符串。
state.display
字段提供人性化的状态标签。
轮询计划:
  • 前2分钟:每15秒检查一次
  • 第2-5分钟:每30秒检查一次
  • 5分钟后:每60秒检查一次
  • 10分钟后超时——汇报当前状态并建议用户手动检查
轮询间隔期间,告知用户当前等待的内容。不得静默循环,也不得询问“是否继续?”——直接继续即可。

Step 3: On Success

步骤3:部署成功时

When
state.isTerminalState
is
true
and status is
DEPLOY_SUCCESS
:
  1. Report the final status
  2. Show replicas ready (e.g., "2/2 replicas ready")
  3. Show the endpoint URL if the service has an exposed port
  4. Optionally run a quick health check on the endpoint:
bash
undefined
state.isTerminalState
true
且状态为
DEPLOY_SUCCESS
时:
  1. 汇报最终状态
  2. 显示就绪副本数(例如:“2/2 副本就绪”)
  3. 若服务暴露端口,则显示端点URL
  4. 可选择对端点执行快速健康检查:
bash
undefined

Only if the service exposes an HTTP port

仅当服务暴露HTTP端口时执行

curl -sf -o /dev/null -w '%{http_code}' "https://ENDPOINT_URL/health" || true

Report the HTTP status code. Do not fail the monitor if the health check fails — just report it.
curl -sf -o /dev/null -w '%{http_code}' "https://ENDPOINT_URL/health" || true

汇报HTTP状态码。即使健康检查失败,也不要终止监控——仅需汇报结果即可。

Step 4: On Failure

步骤4:部署失败时

When status is
BUILD_FAILED
,
DEPLOY_FAILED
,
FAILED
, or
CANCELLED
:
  1. Fetch recent logs using the
    logs
    skill or direct API:
bash
undefined
当状态为
BUILD_FAILED
DEPLOY_FAILED
FAILED
CANCELLED
时:
  1. 获取近期日志:使用
    logs
    技能或直接调用API:
bash
undefined

Get the app ID first from the status response

首先从状态响应中获取应用ID

TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh

Fetch recent logs (last 5 minutes)

获取近期日志(最近5分钟)

bash $TFY_API_SH GET '/api/svc/v1/logs/WORKSPACE_ID/download?applicationFqn=APP_FQN&startTs=START_TS&endTs=END_TS'

2. **Identify the failure cause** from the logs (OOMKilled, CrashLoopBackOff, ImagePullBackOff, port mismatch, etc.)
3. **Suggest a fix** based on the error:

| Error Pattern | Suggested Fix |
|---------------|---------------|
| `OOMKilled` | Increase `memory_limit` in manifest |
| `CrashLoopBackOff` | Check startup command and logs for crash reason |
| `ImagePullBackOff` | Verify image URI and registry credentials |
| Port mismatch | Ensure manifest port matches what the app listens on |
| `Readiness probe failed` | Check health probe path and startup time |
| Build error | Check Dockerfile and build logs |

4. **Report summary** with: error type, relevant log excerpt (max 20 lines), and suggested fix
5. **Do NOT auto-retry.** Present the diagnosis and let the user decide next steps.
bash $TFY_API_SH GET '/api/svc/v1/logs/WORKSPACE_ID/download?applicationFqn=APP_FQN&startTs=START_TS&endTs=END_TS'

2. **从日志中识别失败原因**(OOMKilled、CrashLoopBackOff、ImagePullBackOff、端口不匹配等)
3. **根据错误给出修复建议**:

| 错误模式 | 建议修复方案 |
|---------------|---------------|
| `OOMKilled` | 在清单中增加`memory_limit` |
| `CrashLoopBackOff` | 检查启动命令及崩溃原因日志 |
| `ImagePullBackOff` | 验证镜像URI及镜像仓库凭证 |
| 端口不匹配 | 确保清单端口与应用监听端口一致 |
| `Readiness probe failed` | 检查健康探测路径及启动时间 |
| 构建错误 | 检查Dockerfile及构建日志 |

4. **汇报总结信息**:包含错误类型、相关日志片段(最多20行)及修复建议
5. **不得自动重试**。呈现诊断结果,由用户决定后续操作。

Presenting Status Updates

状态更新展示格式

Use a consistent format for each status update:
Monitoring: my-service in cluster:workspace
Status: ROLLOUT_STARTED | Transition: DEPLOYING
Display: Deploying (1/2 replicas ready)
Elapsed: 45s
Next check in 15s...
Final summary on success:
Deployment complete: my-service
Status: DEPLOY_SUCCESS
Replicas: 2/2 ready
Endpoint: https://my-service-ws.example.com
Health check: 200 OK
Total time: 1m 32s
Final summary on failure:
Deployment failed: my-service
Status: DEPLOY_FAILED
Error: CrashLoopBackOff — container exited with code 1
Log excerpt:
  > ModuleNotFoundError: No module named 'flask'
Suggested fix: Add 'flask' to requirements.txt and redeploy
</instructions>
<success_criteria>
每次状态更新使用统一格式:
监控中:my-service in cluster:workspace
状态:ROLLOUT_STARTED | 过渡状态:DEPLOYING
显示:部署中(1/2 副本就绪)
已耗时:45s
下次检查将在15秒后...
部署成功时的最终总结:
部署完成:my-service
状态:DEPLOY_SUCCESS
副本数:2/2 就绪
端点:https://my-service-ws.example.com
健康检查:200 OK
总耗时:1分32秒
部署失败时的最终总结:
部署失败:my-service
状态:DEPLOY_FAILED
错误:CrashLoopBackOff — 容器以代码1退出
日志片段:
  > ModuleNotFoundError: No module named 'flask'
修复建议:将'flask'添加到requirements.txt后重新部署
</instructions>
<success_criteria>

Success Criteria

成功标准

  • Deployment status is tracked from current state to a terminal state
  • User sees clear progress updates at each polling interval
  • On success: replicas, endpoint URL, and optional health check are reported
  • On failure: logs are fetched, root cause is identified, and a fix is suggested
  • Monitor times out gracefully after 10 minutes with a status summary
  • The user is never left waiting without feedback
</success_criteria>
<references>
  • 部署状态从当前状态跟踪至终端状态
  • 用户在每个轮询间隔都能看到清晰的进度更新
  • 部署成功时:汇报副本数、端点URL及可选的健康检查结果
  • 部署失败时:获取日志、识别根本原因并给出修复建议
  • 监控在10分钟后优雅超时并汇报状态总结
  • 不会让用户在无反馈的情况下等待
</success_criteria>
<references>

Composability

可组合性

  • Before monitoring: Use
    deploy
    skill to deploy, then monitor
  • On failure: Use
    logs
    skill for deeper log analysis
  • Check app details: Use
    applications
    skill for full app info
  • Fix and redeploy: Use
    deploy
    skill to apply fixes
</references> <troubleshooting>
  • 监控前:使用
    deploy
    技能完成部署,再启动监控
  • 失败时:使用
    logs
    技能进行深度日志分析
  • 查看应用详情:使用
    applications
    技能获取完整应用信息
  • 修复并重部署:使用
    deploy
    技能应用修复方案
</references> <troubleshooting>

Error Handling

错误处理

Application Not Found

应用未找到

Application "APP_NAME" not found in workspace "WORKSPACE_FQN".
Check:
- Application name is spelled correctly
- The deploy/apply command completed successfully
- You're checking the correct workspace
在工作区"WORKSPACE_FQN"中未找到应用"APP_NAME"。
请检查:
- 应用名称拼写是否正确
- deploy/apply命令是否执行成功
- 是否在正确的工作区中检查

Timeout

超时

Monitoring timed out after 10 minutes.
Current status: ROLLOUT_STARTED | Transition: DEPLOYING
The deployment is still in progress. Check manually:
- TrueFoundry dashboard: TFY_BASE_URL
- Or run this skill again to resume monitoring
监控已超时(10分钟)。
当前状态:ROLLOUT_STARTED | 过渡状态:DEPLOYING
部署仍在进行中。请手动检查:
- TrueFoundry控制台:TFY_BASE_URL
- 或再次调用该技能恢复监控

Permission Denied

权限不足

Cannot access this application. Check your API key permissions for this workspace.
</troubleshooting>
无法访问该应用。请检查你的API密钥对该工作区的权限。
</troubleshooting>