truefoundry-monitor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<objective>Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.
<objective>路由说明:对于模糊的用户意图,请使用references/intent-clarification.md中的通用澄清模板。
Monitor Deployment
监控部署
Track a TrueFoundry deployment rollout to completion, reporting status at each stage and diagnosing failures automatically.
跟踪TrueFoundry部署发布直至完成,在每个阶段汇报状态并自动诊断失败原因。
When to Use
适用场景
- After or
tfy applyto track rollout progresstfy deploy - User says "monitor", "watch deployment", "is my deploy done", "check rollout"
- Called automatically by the skill after a successful apply/deploy
deploy
- 执行或
tfy apply后跟踪发布进度tfy deploy - 用户提及“监控”、“查看部署状态”、“我的部署完成了吗”、“检查发布情况”时
- 在技能成功完成apply/deploy操作后自动调用
deploy
When NOT to Use
不适用场景
- User wants to deploy → prefer skill; ask if the user wants another valid path
deploy - User wants to list all apps → prefer skill; ask if the user wants another valid path
applications - User wants to read historical logs → prefer skill; ask if the user wants another valid path
logs
- 用户想要部署→优先使用技能;询问用户是否需要其他有效操作路径
deploy - 用户想要列出所有应用→优先使用技能;询问用户是否需要其他有效操作路径
applications - 用户想要查看历史日志→优先使用技能;询问用户是否需要其他有效操作路径
logs
CRITICAL BEHAVIOR RULES
核心行为规则
RULE 1: Once monitoring starts, you MUST poll until a terminal state or timeout. Do NOT stop early. Do NOT ask the user "should I keep checking?" — just keep checking.RULE 2: Do NOT end your response while the deployment is in a non-terminal state (BUILDING, INITIALIZED, ROLLOUT_STARTED). If you are about to stop and the status is non-terminal, you are violating this rule — continue polling.RULE 3: Between each poll, briefly tell the user what you're waiting for. Do NOT silently loop, but also do NOT ask for permission to continue.
规则1:一旦监控启动,必须持续轮询直到进入终端状态或超时。不得提前停止。不得询问用户“是否继续检查?”——直接持续检查即可。规则2:当部署处于非终端状态(BUILDING、INITIALIZED、ROLLOUT_STARTED)时,不得终止响应。如果即将停止但状态仍为非终端,则违反本规则——需继续轮询。规则3:每次轮询间隔期间,需简要告知用户当前等待的内容。不得静默循环,但也无需请求继续权限。
Required Information
必要信息
Before monitoring, you need:
- Workspace FQN () — HARD RULE: Never auto-pick. Always ask the user to confirm.
TFY_WORKSPACE_FQN - Application name — the service or job name being deployed
If invoked right after a deploy, both should already be known from the deploy context.
开始监控前,你需要:
- 工作区FQN()——硬性规则:不得自动选择。必须要求用户确认。
TFY_WORKSPACE_FQN - 应用名称——正在部署的服务或任务名称
如果是在部署完成后立即调用监控,这两项信息应已从部署上下文中获取。
Execution Priority
执行优先级
For all status checks, use MCP tool calls first:
tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "APP_NAME"})If MCP tool calls are unavailable, fall back to direct API via .
tfy-api.shWhen using direct API, set to the full path of this skill's . See for paths per agent.
TFY_API_SHscripts/tfy-api.shreferences/tfy-api-setup.md所有状态检查优先使用MCP工具调用:
tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "APP_NAME"})若MCP工具调用不可用,则回退到通过直接调用API。
tfy-api.sh使用直接API时,需将设置为该技能的的完整路径。请参考查看各Agent对应的路径。
TFY_API_SHscripts/tfy-api.shreferences/tfy-api-setup.mdMonitoring Flow
监控流程
Step 1: Initial Status Check
步骤1:初始状态检查
bash
TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
bash $TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=APP_NAME'Extract from the response at (the application object):
data[0]- — the deployment status enum
deployment.currentStatus.status - — current transition (e.g.,
deployment.currentStatus.transition,BUILDING)DEPLOYING - — boolean, most reliable terminal check
deployment.currentStatus.state.isTerminalState - — human-readable state
deployment.currentStatus.state.display
bash
TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
bash $TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=APP_NAME'从响应的(应用对象)中提取:
data[0]- ——部署状态枚举值
deployment.currentStatus.status - ——当前过渡状态(例如:
deployment.currentStatus.transition、BUILDING)DEPLOYING - ——布尔值,最可靠的终端状态判断依据
deployment.currentStatus.state.isTerminalState - ——人性化可读状态
deployment.currentStatus.state.display
Step 2: Poll Until Terminal State
步骤2:轮询直至终端状态
The API response has two key fields: (the deployment status) and (what's happening now). Use as the authoritative check for whether to stop polling.
statustransitionstate.isTerminalStateStatus values (from ):
deployment.currentStatus.status| Status | Terminal? | Action |
|---|---|---|
| No | Report "Deployment initialized, waiting...", continue polling |
| No | Report "Build in progress", continue polling |
| No | Report "Build succeeded, deploying...", continue polling |
| Yes | Fetch build logs, report failure |
| No | Report "Rollout started", continue polling |
| Yes | Report success with endpoint URL |
| Yes | Fetch pod logs, diagnose failure |
| No | Report "Deploy failed, retrying...", continue polling |
| Yes | Report paused/stopped |
| Yes | Report general failure |
| Yes | Report cancelled |
Transition values (from ):
deployment.currentStatus.transition| Transition | Meaning |
|---|---|
| Image build is in progress |
| Pods are being created/updated |
| Skipping build, reusing cached image |
| Multi-component deployment in progress |
| Waiting for resources |
Best practice: Always checkto decide whether to stop polling, rather than matching individual status strings. Thedeployment.currentStatus.state.isTerminalState === truefield gives a human-friendly label.state.display
Polling schedule:
- First 2 minutes: check every 15 seconds
- Minutes 2-5: check every 30 seconds
- After 5 minutes: check every 60 seconds
- Timeout after 10 minutes — report current state and suggest the user check manually
Between polls, tell the user what you're waiting for. Do not silently loop. Do NOT ask "should I continue?" — just continue.
API响应包含两个关键字段:(部署状态)和(当前操作)。需使用作为是否停止轮询的权威判断标准。
statustransitionstate.isTerminalState状态值(来自):
deployment.currentStatus.status| 状态值 | 是否为终端状态 | 操作 |
|---|---|---|
| 否 | 汇报“部署已初始化,等待中...”,继续轮询 |
| 否 | 汇报“镜像构建中”,继续轮询 |
| 否 | 汇报“镜像构建成功,部署中...”,继续轮询 |
| 是 | 获取构建日志,汇报失败信息 |
| 否 | 汇报“发布已启动”,继续轮询 |
| 是 | 汇报成功状态及端点URL |
| 是 | 获取Pod日志,诊断失败原因 |
| 否 | 汇报“部署失败,重试中...”,继续轮询 |
| 是 | 汇报已暂停/已停止 |
| 是 | 汇报通用失败信息 |
| 是 | 汇报已取消 |
过渡状态值(来自):
deployment.currentStatus.transition| 过渡状态 | 含义 |
|---|---|
| 镜像构建中 |
| Pod正在创建/更新 |
| 跳过构建,复用缓存镜像 |
| 多组件部署中 |
| 等待资源分配 |
最佳实践:始终通过检查来决定是否停止轮询,而非匹配单个状态字符串。deployment.currentStatus.state.isTerminalState === true字段提供人性化的状态标签。state.display
轮询计划:
- 前2分钟:每15秒检查一次
- 第2-5分钟:每30秒检查一次
- 5分钟后:每60秒检查一次
- 10分钟后超时——汇报当前状态并建议用户手动检查
轮询间隔期间,告知用户当前等待的内容。不得静默循环,也不得询问“是否继续?”——直接继续即可。
Step 3: On Success
步骤3:部署成功时
When is and status is :
state.isTerminalStatetrueDEPLOY_SUCCESS- Report the final status
- Show replicas ready (e.g., "2/2 replicas ready")
- Show the endpoint URL if the service has an exposed port
- Optionally run a quick health check on the endpoint:
bash
undefined当为且状态为时:
state.isTerminalStatetrueDEPLOY_SUCCESS- 汇报最终状态
- 显示就绪副本数(例如:“2/2 副本就绪”)
- 若服务暴露端口,则显示端点URL
- 可选择对端点执行快速健康检查:
bash
undefinedOnly if the service exposes an HTTP port
仅当服务暴露HTTP端口时执行
curl -sf -o /dev/null -w '%{http_code}' "https://ENDPOINT_URL/health" || true
Report the HTTP status code. Do not fail the monitor if the health check fails — just report it.curl -sf -o /dev/null -w '%{http_code}' "https://ENDPOINT_URL/health" || true
汇报HTTP状态码。即使健康检查失败,也不要终止监控——仅需汇报结果即可。Step 4: On Failure
步骤4:部署失败时
When status is , , , or :
BUILD_FAILEDDEPLOY_FAILEDFAILEDCANCELLED- Fetch recent logs using the skill or direct API:
logs
bash
undefined当状态为、、或时:
BUILD_FAILEDDEPLOY_FAILEDFAILEDCANCELLED- 获取近期日志:使用技能或直接调用API:
logs
bash
undefinedGet the app ID first from the status response
首先从状态响应中获取应用ID
TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
Fetch recent logs (last 5 minutes)
获取近期日志(最近5分钟)
bash $TFY_API_SH GET '/api/svc/v1/logs/WORKSPACE_ID/download?applicationFqn=APP_FQN&startTs=START_TS&endTs=END_TS'
2. **Identify the failure cause** from the logs (OOMKilled, CrashLoopBackOff, ImagePullBackOff, port mismatch, etc.)
3. **Suggest a fix** based on the error:
| Error Pattern | Suggested Fix |
|---------------|---------------|
| `OOMKilled` | Increase `memory_limit` in manifest |
| `CrashLoopBackOff` | Check startup command and logs for crash reason |
| `ImagePullBackOff` | Verify image URI and registry credentials |
| Port mismatch | Ensure manifest port matches what the app listens on |
| `Readiness probe failed` | Check health probe path and startup time |
| Build error | Check Dockerfile and build logs |
4. **Report summary** with: error type, relevant log excerpt (max 20 lines), and suggested fix
5. **Do NOT auto-retry.** Present the diagnosis and let the user decide next steps.bash $TFY_API_SH GET '/api/svc/v1/logs/WORKSPACE_ID/download?applicationFqn=APP_FQN&startTs=START_TS&endTs=END_TS'
2. **从日志中识别失败原因**(OOMKilled、CrashLoopBackOff、ImagePullBackOff、端口不匹配等)
3. **根据错误给出修复建议**:
| 错误模式 | 建议修复方案 |
|---------------|---------------|
| `OOMKilled` | 在清单中增加`memory_limit` |
| `CrashLoopBackOff` | 检查启动命令及崩溃原因日志 |
| `ImagePullBackOff` | 验证镜像URI及镜像仓库凭证 |
| 端口不匹配 | 确保清单端口与应用监听端口一致 |
| `Readiness probe failed` | 检查健康探测路径及启动时间 |
| 构建错误 | 检查Dockerfile及构建日志 |
4. **汇报总结信息**:包含错误类型、相关日志片段(最多20行)及修复建议
5. **不得自动重试**。呈现诊断结果,由用户决定后续操作。Presenting Status Updates
状态更新展示格式
Use a consistent format for each status update:
Monitoring: my-service in cluster:workspace
Status: ROLLOUT_STARTED | Transition: DEPLOYING
Display: Deploying (1/2 replicas ready)
Elapsed: 45s
Next check in 15s...Final summary on success:
Deployment complete: my-service
Status: DEPLOY_SUCCESS
Replicas: 2/2 ready
Endpoint: https://my-service-ws.example.com
Health check: 200 OK
Total time: 1m 32sFinal summary on failure:
Deployment failed: my-service
Status: DEPLOY_FAILED
Error: CrashLoopBackOff — container exited with code 1
Log excerpt:
> ModuleNotFoundError: No module named 'flask'
Suggested fix: Add 'flask' to requirements.txt and redeploy<success_criteria>
每次状态更新使用统一格式:
监控中:my-service in cluster:workspace
状态:ROLLOUT_STARTED | 过渡状态:DEPLOYING
显示:部署中(1/2 副本就绪)
已耗时:45s
下次检查将在15秒后...部署成功时的最终总结:
部署完成:my-service
状态:DEPLOY_SUCCESS
副本数:2/2 就绪
端点:https://my-service-ws.example.com
健康检查:200 OK
总耗时:1分32秒部署失败时的最终总结:
部署失败:my-service
状态:DEPLOY_FAILED
错误:CrashLoopBackOff — 容器以代码1退出
日志片段:
> ModuleNotFoundError: No module named 'flask'
修复建议:将'flask'添加到requirements.txt后重新部署<success_criteria>
Success Criteria
成功标准
- Deployment status is tracked from current state to a terminal state
- User sees clear progress updates at each polling interval
- On success: replicas, endpoint URL, and optional health check are reported
- On failure: logs are fetched, root cause is identified, and a fix is suggested
- Monitor times out gracefully after 10 minutes with a status summary
- The user is never left waiting without feedback
</success_criteria>
<references>- 部署状态从当前状态跟踪至终端状态
- 用户在每个轮询间隔都能看到清晰的进度更新
- 部署成功时:汇报副本数、端点URL及可选的健康检查结果
- 部署失败时:获取日志、识别根本原因并给出修复建议
- 监控在10分钟后优雅超时并汇报状态总结
- 不会让用户在无反馈的情况下等待
</success_criteria>
<references>Composability
可组合性
- Before monitoring: Use skill to deploy, then monitor
deploy - On failure: Use skill for deeper log analysis
logs - Check app details: Use skill for full app info
applications - Fix and redeploy: Use skill to apply fixes
deploy
- 监控前:使用技能完成部署,再启动监控
deploy - 失败时:使用技能进行深度日志分析
logs - 查看应用详情:使用技能获取完整应用信息
applications - 修复并重部署:使用技能应用修复方案
deploy
Error Handling
错误处理
Application Not Found
应用未找到
Application "APP_NAME" not found in workspace "WORKSPACE_FQN".
Check:
- Application name is spelled correctly
- The deploy/apply command completed successfully
- You're checking the correct workspace在工作区"WORKSPACE_FQN"中未找到应用"APP_NAME"。
请检查:
- 应用名称拼写是否正确
- deploy/apply命令是否执行成功
- 是否在正确的工作区中检查Timeout
超时
Monitoring timed out after 10 minutes.
Current status: ROLLOUT_STARTED | Transition: DEPLOYING
The deployment is still in progress. Check manually:
- TrueFoundry dashboard: TFY_BASE_URL
- Or run this skill again to resume monitoring监控已超时(10分钟)。
当前状态:ROLLOUT_STARTED | 过渡状态:DEPLOYING
部署仍在进行中。请手动检查:
- TrueFoundry控制台:TFY_BASE_URL
- 或再次调用该技能恢复监控Permission Denied
权限不足
Cannot access this application. Check your API key permissions for this workspace.无法访问该应用。请检查你的API密钥对该工作区的权限。