ci-monitoring
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCI Monitoring
CI 监控
Overview
概述
Monitor CI pipeline and resolve failures until green.
CRITICAL: CI is validation, not discovery.
If CI finds a bug you didn't find locally, your local testing was insufficient.Before blaming CI, ask yourself:
- Did you run all tests locally?
- Did you test against local services (postgres, redis)?
- Did you run the same checks CI runs?
- Did you run integration tests, not just unit tests with mocks?
CI should only fail for: environment differences, flaky tests, or infrastructure issues—never for bugs you could have caught locally.
Core principle: CI failures are blockers. But they should never be surprises.
Announce at start: "I'm monitoring CI and will resolve any failures."
监控CI流水线,解决失败问题直至流水线变绿。
重要提示:CI是验证环节,而非发现环节。
如果CI发现了你在本地没找到的bug,说明你的本地测试不充分。在指责CI之前,请先问自己:
- 你在本地运行了所有测试吗?
- 你针对本地服务(postgres、redis)进行测试了吗?
- 你运行了CI会执行的所有检查吗?
- 你运行了集成测试,而不只是使用mock的单元测试吗?
CI仅应因以下情况失败:环境差异、不稳定测试(flaky tests)或基础设施问题——绝不能是你本可以在本地发现的bug。
核心原则:CI失败是阻塞项,但绝不应是意外情况。
开始时告知:"我正在监控CI,将解决所有失败问题。"
The CI Loop
CI 循环
PR Created
│
▼
┌─────────────┐
│ Wait for CI │
└──────┬──────┘
│
▼
┌─────────────┐
│ CI Status? │
└──────┬──────┘
│
┌───┴───┐
│ │
Green Red/Failed
│ │
▼ ▼
┌─────────┐ ┌─────────────┐
│ MERGE │ │ Diagnose │
│ THE PR │ │ failure │
└────┬────┘ └──────┬──────┘
│ │
▼ ▼
┌─────────┐ ┌─────────────┐
│ Continue│ │ Fixable? │
│ to next │ └──────┬──────┘
│ issue │ │
└─────────┘ ┌────┴────┐
│ │
Yes No
│ │
▼ ▼
┌─────────┐ ┌─────────────┐
│ Fix and │ │ Document as │
│ push │ │ unresolvable│
└────┬────┘ └─────────────┘
│
└────► Back to "Wait for CI"PR Created
│
▼
┌─────────────┐
│ Wait for CI │
└──────┬──────┘
│
▼
┌─────────────┐
│ CI Status? │
└──────┬──────┘
│
┌───┴───┐
│ │
Green Red/Failed
│ │
▼ ▼
┌─────────┐ ┌─────────────┐
│ MERGE │ │ Diagnose │
│ THE PR │ │ failure │
└────┬────┘ └──────┬──────┘
│ │
▼ ▼
┌─────────┐ ┌─────────────┐
│ Continue│ │ Fixable? │
│ to next │ └──────┬──────┘
│ issue │ │
└─────────┘ ┌────┴────┐
│ │
Yes No
│ │
▼ ▼
┌─────────┐ ┌─────────────┐
│ Fix and │ │ Document as │
│ push │ │ unresolvable│
└────┬────┘ └─────────────┘
│
└────► Back to "Wait for CI"CRITICAL: Green CI = Merge Immediately
重要提示:CI变绿后立即合并
When CI passes, you MUST merge the PR and continue working.
Do NOT:
- Stop and report "CI is green, ready for review"
- Wait for user confirmation
- Summarize and ask what to do next
DO:
- Merge the PR immediately:
gh pr merge [PR_NUMBER] --squash --delete-branch - Mark the linked issue as Done
- Continue to the next issue in scope
bash
undefined当CI通过时,你必须立即合并PR并继续工作。
禁止:
- 停下来报告"CI已变绿,等待审核"
- 等待用户确认
- 总结情况并询问下一步操作
必须:
- 立即合并PR:
gh pr merge [PR_NUMBER] --squash --delete-branch - 将关联的issue标记为已完成
- 处理范围内的下一个issue
bash
undefinedWhen CI passes
When CI passes
gh pr merge [PR_NUMBER] --squash --delete-branch
gh pr merge [PR_NUMBER] --squash --delete-branch
Update linked issue status
Update linked issue status
gh issue edit [ISSUE_NUMBER] --remove-label "status:in-review" --add-label "status:done"
gh issue edit [ISSUE_NUMBER] --remove-label "status:in-review" --add-label "status:done"
Continue to next issue (do not stop)
Continue to next issue (do not stop)
**The only exception:** PRs with `do-not-merge` label require explicit user action.
**唯一例外:带有`do-not-merge`标签的PR需要用户明确操作。**Checking CI Status
检查CI状态
Using GitHub CLI
使用GitHub CLI
bash
undefinedbash
undefinedCheck all CI checks
Check all CI checks
gh pr checks [PR_NUMBER]
gh pr checks [PR_NUMBER]
Watch CI in real-time
Watch CI in real-time
gh pr checks [PR_NUMBER] --watch
gh pr checks [PR_NUMBER] --watch
Get detailed status
Get detailed status
gh pr view [PR_NUMBER] --json statusCheckRollup
undefinedgh pr view [PR_NUMBER] --json statusCheckRollup
undefinedExpected Output
预期输出
All checks were successful
0 failing, 0 pending, 5 passing
CHECKS
✓ build 1m23s
✓ lint 45s
✓ test 3m12s
✓ typecheck 1m05s
✓ security-scan 2m30sAll checks were successful
0 failing, 0 pending, 5 passing
CHECKS
✓ build 1m23s
✓ lint 45s
✓ test 3m12s
✓ typecheck 1m05s
✓ security-scan 2m30sHandling Failures
处理失败问题
Step 1: Identify the Failure
步骤1:定位失败项
bash
undefinedbash
undefinedGet failed check details
Get failed check details
gh pr checks [PR_NUMBER]
gh pr checks [PR_NUMBER]
View workflow run logs
View workflow run logs
gh run view [RUN_ID] --log-failed
undefinedgh run view [RUN_ID] --log-failed
undefinedStep 2: Diagnose the Cause
步骤2:排查原因
Common failure types:
| Type | Symptoms | Cause |
|---|---|---|
| Test failure | | Code bug or test bug |
| Build failure | Compilation errors | Type errors, syntax errors |
| Lint failure | Style violations | Formatting, conventions |
| Typecheck failure | Type errors | Missing types, wrong types |
| Timeout | Job exceeded time limit | Performance issue or stuck test |
| Flaky test | Passes locally, fails CI | Race condition, environment difference |
常见失败类型:
| 类型 | 症状 | 原因 |
|---|---|---|
| 测试失败 | 测试输出中出现 | 代码bug或测试bug |
| 构建失败 | 编译错误 | 类型错误、语法错误 |
| Lint检查失败 | 风格违规 | 格式问题、不符合规范 |
| 类型检查失败 | 类型错误 | 缺失类型、类型不匹配 |
| 超时 | 任务超出时间限制 | 性能问题或测试卡住 |
| 不稳定测试 | 本地通过,CI失败;CI重试后通过 | 竞态条件、环境差异 |
Step 3: Fix the Issue
步骤3:修复问题
Test Failures
测试失败
bash
undefinedbash
undefinedReproduce locally
Reproduce locally
pnpm test
pnpm test
Run specific failing test
Run specific failing test
pnpm test --grep "test name"
pnpm test --grep "test name"
Fix the code or test
Fix the code or test
Commit and push
Commit and push
undefinedundefinedBuild Failures
构建失败
bash
undefinedbash
undefinedReproduce locally
Reproduce locally
pnpm build
pnpm build
Fix compilation errors
Fix compilation errors
Commit and push
Commit and push
undefinedundefinedLint Failures
Lint检查失败
bash
undefinedbash
undefinedCheck lint errors
Check lint errors
pnpm lint
pnpm lint
Auto-fix what's possible
Auto-fix what's possible
pnpm lint:fix
pnpm lint:fix
Manually fix remaining
Manually fix remaining
Commit and push
Commit and push
undefinedundefinedType Failures
类型检查失败
bash
undefinedbash
undefinedCheck type errors
Check type errors
pnpm typecheck
pnpm typecheck
Fix type issues
Fix type issues
Commit and push
Commit and push
undefinedundefinedStep 4: Push Fix and Wait
步骤4:推送修复并等待
bash
undefinedbash
undefinedCommit fix
Commit fix
git add .
git commit -m "fix(ci): Resolve test failure in user validation"
git add .
git commit -m "fix(ci): Resolve test failure in user validation"
Push
Push
git push
git push
Wait for CI again
Wait for CI again
gh pr checks [PR_NUMBER] --watch
undefinedgh pr checks [PR_NUMBER] --watch
undefinedStep 5: Repeat Until Green
步骤5:循环直至变绿
Loop through diagnose → fix → push → wait until all checks pass.
重复 排查→修复→推送→等待 的流程,直至所有检查通过。
Flaky Tests
不稳定测试(Flaky Tests)
Identifying Flakiness
识别不稳定测试
Test passes locally
Test fails in CI
Test passes on retry in CITest passes locally
Test fails in CI
Test passes on retry in CIHandling Flakiness
处理不稳定测试
- Don't just retry - Find the root cause
- Check for race conditions - Timing-dependent code
- Check for environment differences - Paths, env vars, services
- Check for state pollution - Tests affecting each other
typescript
// Common flaky pattern: timing dependency
// BAD
await saveData();
await delay(100); // Hoping 100ms is enough
const result = await loadData();
// GOOD: Wait for condition
await saveData();
await waitFor(() => dataExists());
const result = await loadData();- 不要仅重试 - 找到根本原因
- 检查竞态条件 - 依赖时序的代码
- 检查环境差异 - 路径、环境变量、服务
- 检查状态污染 - 测试之间互相影响
typescript
// Common flaky pattern: timing dependency
// BAD
await saveData();
await delay(100); // Hoping 100ms is enough
const result = await loadData();
// GOOD: Wait for condition
await saveData();
await waitFor(() => dataExists());
const result = await loadData();Unresolvable Failures
无法解决的问题
Sometimes failures can't be fixed in the current PR:
有时当前PR无法修复失败问题:
Legitimate Unresolvable Cases
合理的无法解决场景
| Case | Example |
|---|---|
| CI infrastructure issue | Service down, rate limited |
| Pre-existing flaky test | Not introduced by this PR |
| Upstream dependency issue | External API changed |
| Requires manual intervention | Needs secrets, permissions |
| 场景 | 示例 |
|---|---|
| CI基础设施问题 | 服务宕机、请求受限 |
| 已存在的不稳定测试 | 并非当前PR引入 |
| 上游依赖问题 | 外部API变更 |
| 需要人工干预 | 需要密钥、权限 |
Process for Unresolvable
无法解决问题的处理流程
- Document the issue
bash
gh pr comment [PR_NUMBER] --body "## CI Issue
The \`security-scan\` check is failing due to a known issue with the scanner service (see #999).
This is not related to changes in this PR. The scan passes when run locally.
Requesting bypass approval from @maintainer."- Create issue if new
bash
gh issue create \
--title "CI: Security scanner service timeout" \
--body "The security scanner is timing out in CI..."- Request bypass if appropriate
Some teams allow merging with known infrastructure failures.
- Do NOT merge with real failures
If the failure is from your code, it must be fixed.
- 记录问题
bash
gh pr comment [PR_NUMBER] --body "## CI Issue
The \`security-scan\` check is failing due to a known issue with the scanner service (see #999).
This is not related to changes in this PR. The scan passes when run locally.
Requesting bypass approval from @maintainer."- 若为新问题则创建issue
bash
gh issue create \
--title "CI: Security scanner service timeout" \
--body "The security scanner is timing out in CI..."- 必要时申请绕过
部分团队允许在已知基础设施故障的情况下合并PR。
- 绝不能合并存在真实代码问题的PR
若失败是由你的代码导致,必须修复后再合并。
CI Best Practices
CI最佳实践
Run Locally First (MANDATORY)
先在本地运行(强制要求)
CI is the last resort, not the first check.
Before pushing, run EVERYTHING CI will run:
bash
undefinedCI是最后一道防线,而非第一道检查。
推送前,运行CI会执行的所有内容:
bash
undefinedRun the same checks CI will run
Run the same checks CI will run
pnpm lint
pnpm typecheck
pnpm test # Unit tests
pnpm test:integration # Integration tests against real services
pnpm build
pnpm lint
pnpm typecheck
pnpm test # Unit tests
pnpm test:integration # Integration tests against real services
pnpm build
If you have database changes
If you have database changes
docker-compose up -d postgres
pnpm migrate
**If your project has docker-compose services:**
- Start them before testing: `docker-compose up -d`
- Run integration tests against real services
- Verify migrations apply to real database
- Don't rely on mocks alone
**Skill:** `local-service-testing`docker-compose up -d postgres
pnpm migrate
**若项目包含docker-compose服务:**
- 测试前启动服务:`docker-compose up -d`
- 针对真实服务运行集成测试
- 验证迁移可应用于真实数据库
- 不要仅依赖mock
**技能:`local-service-testing`**Commit Incrementally
增量提交
Don't push 10 commits at once. Push smaller changes:
bash
undefined不要一次性推送10个提交,推送更小的变更:
bash
undefinedSmall fix, push, verify
Small fix, push, verify
git push
git push
Wait for CI
Wait for CI
gh pr checks --watch
gh pr checks --watch
Then next change
Then next change
undefinedundefinedMonitor Actively
主动监控
Don't "push and forget":
bash
undefined不要“推送后就不管了”:
bash
undefinedWatch CI after each push
Watch CI after each push
gh pr checks [PR_NUMBER] --watch
undefinedgh pr checks [PR_NUMBER] --watch
undefinedChecklist
检查清单
For each CI run:
- Waited for CI to complete
- All checks examined
- Failures diagnosed (if any)
- Fixes implemented (if needed)
- Re-pushed and re-checked (if fixed)
- All green
When CI is green:
- PR merged immediately ()
gh pr merge --squash --delete-branch - Linked issue marked Done
- Continued to next issue (do NOT stop and report)
For unresolvable issues:
- Root cause identified
- Not caused by PR changes
- Documented in PR comment
- Issue created if new problem
- Bypass approval requested if appropriate
每次CI运行:
- 等待CI完成
- 检查所有项
- 排查失败原因(若有)
- 实施修复(若需要)
- 重新推送并检查(若已修复)
- 所有项变绿
当CI变绿时:
- 立即合并PR ()
gh pr merge --squash --delete-branch - 关联issue标记为已完成
- 处理下一个任务(不要停下来报告)
对于无法解决的问题:
- 确定根本原因
- 与当前PR变更无关
- 在PR评论中记录
- 若为新问题则创建issue
- 必要时申请绕过审批
Integration
集成
This skill is called by:
- - Step 13
issue-driven-development - - Main loop and bootstrap
autonomous-orchestration
This skill follows:
- - PR exists
pr-creation
This skill completes:
- The PR lifecycle - merge is the final step, not "verification-before-merge"
This skill may trigger:
- - If CI reveals deeper issues
error-recovery
本技能由以下流程调用:
- - 步骤13
issue-driven-development - - 主循环与启动流程
autonomous-orchestration
本技能基于以下流程:
- - PR已创建
pr-creation
本技能完成以下环节:
- PR生命周期 - 合并是最终步骤,而非“合并前验证”
本技能可能触发:
- - 若CI暴露更深层次问题
error-recovery