fix-buildkite-ci
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFix Buildkite CI
修复Buildkite CI
Overview
概述
Diagnose Buildkite failures programmatically and avoid guessing from UI screenshots. Prefer structured build/job JSON plus artifact inspection to find the exact failing test case and mismatch, then implement the smallest correct fix.
以程序化方式诊断Buildkite失败,避免从UI截图猜测。优先使用结构化的构建/作业JSON加上工件检查,找到确切的失败测试用例和不匹配项,然后实施最小的正确修复。
Target Selection
目标选择
Resolve triage target with this precedence:
- If user provides a Buildkite build URL, use that build directly.
- Else if user specifies a branch and/or a pipeline (for example ,
pull-request), use the specified scope.main-cron - Else default to the current git branch and inspect the checks for the PR associated with that branch.
按以下优先级确定排查目标:
- 如果用户提供Buildkite构建URL,直接使用该构建。
- 否则如果用户指定了分支和/或流水线(例如、
pull-request),使用指定范围。main-cron - 否则默认使用当前Git分支,并检查与该分支关联的PR的检查结果。
Workflow
工作流程
- Identify the failing Buildkite build(s).
- Retrieve build JSON and list failed jobs.
- Pull job logs and extract the first concrete failure signal.
- Inspect artifacts when top-level logs are truncated.
- Map failure to root cause and apply a focused fix.
- Verify locally where feasible and summarize evidence.
Use CLI first. If auth is unavailable, use public Buildkite JSON/log/artifact endpoints via .
bkcurlFor exact commands and endpoint patterns, read .
references/buildkite-ci-triage.md- 识别失败的Buildkite构建。
- 获取构建JSON并列出失败的作业。
- 拉取作业日志并提取第一个具体的失败信号。
- 当顶层日志被截断时,检查工件。
- 将失败映射到根本原因并应用针对性修复。
- 尽可能在本地验证并总结证据。
优先使用 CLI。如果无法认证,通过使用公共Buildkite JSON/日志/工件端点。
bkcurl有关确切命令和端点模式,请阅读。
references/buildkite-ci-triage.mdStep 1: Identify Failing Buildkite Checks
步骤1:识别失败的Buildkite检查
When no explicit target is given, find the PR for the current branch first, then run to find failing checks and capture Buildkite URLs ().
gh pr checks <PR_NUMBER>.../builds/<N>If user specifies a branch/pipeline, list and filter builds with using those parameters.
If user provides a Buildkite build URL, skip discovery and start from that build number.
bk build list当未给出明确目标时,先找到当前分支对应的PR,然后运行找到失败的检查并捕获Buildkite URL()。
gh pr checks <PR_NUMBER>.../builds/<N>如果用户指定了分支/流水线,使用结合这些参数列出并筛选构建。
如果用户提供了Buildkite构建URL,跳过发现步骤,直接从该构建编号开始。
bk build listStep 2: Pull Build JSON and Failed Jobs
步骤2:拉取构建JSON和失败作业
Fetch , then list failed jobs by non-zero .
builds/<N>.jsonexit_statusCapture at least:
- pipeline
- build number
- job id
- job name
- exit status
获取,然后通过非零列出失败的作业。
builds/<N>.jsonexit_status至少捕获以下信息:
- 流水线
- 构建编号
- 作业ID
- 作业名称
- 退出状态
Step 3: Extract the Concrete Failure
步骤3:提取具体失败信息
Fetch each failed job log and search for high-signal patterns:
query result mismatch[Diff] (-expected|+actual)query is expected to fail with error:- panic/assertion lines
- deterministic simulation error markers
- OOM/timeout/cancellation markers
Stop once you have one concrete failing file/case and mismatch.
拉取每个失败作业的日志并搜索高信号模式:
query result mismatch[Diff] (-expected|+actual)query is expected to fail with error:- 恐慌/断言行
- 确定性模拟错误标记
- OOM/超时/取消标记
找到一个具体的失败文件/用例和不匹配项后即可停止。
Step 4: Fall Back to Artifacts
步骤4: fallback到工件
If logs only show wrapper errors (for example, command exited with status), inspect artifacts from the same job, especially:
risedev-logs.ziprisedev-logs/nodetype-*.log
Extract and search artifact logs for the exact mismatch.
如果日志仅显示包装器错误(例如命令以状态码退出),检查同一作业的工件,尤其是:
risedev-logs.ziprisedev-logs/nodetype-*.log
提取并搜索工件日志以找到确切的不匹配项。
Step 5: Apply Focused Fixes
步骤5:应用针对性修复
Prefer minimal fixes tied to evidence:
- SQLLogicTest mismatch: update expected sections in the correct /
.sltfile only when query output change is intentional..slt.part - Wrong runtime behavior: fix source code and keep tests as-is.
- Flaky/cancellation-only signal (): treat as infra/cancel unless corroborated by product errors.
143
Avoid broad "retry and hope" actions without root-cause evidence.
优先采用与证据绑定的最小修复:
- SQLLogicTest不匹配:仅当查询输出变更为有意时,更新正确/
.slt文件中的预期部分。.slt.part - 错误的运行时行为:修复源代码并保持测试不变。
- 不稳定/仅取消信号():除非有产品错误佐证,否则视为基础设施/取消问题。
143
避免在没有根本原因证据的情况下进行宽泛的“重试碰运气”操作。
Step 6: Verify and Report
步骤6:验证并报告
Run the narrowest local check that validates the fix when possible. If full validation is not feasible, state it explicitly.
Always report:
- failing check/build/job identifiers
- failing file/test/case
- exact mismatch/error evidence
- applied fix (files changed)
- verification status and remaining risk
尽可能运行最窄范围的本地检查来验证修复。如果无法进行完整验证,请明确说明。
始终报告以下内容:
- 失败的检查/构建/作业标识符
- 失败的文件/测试/用例
- 确切的不匹配/错误证据
- 应用的修复(更改的文件)
- 验证状态和剩余风险
Buildkite-Specific Heuristics
Buildkite特定启发式规则
- Exit code : often wrapper failure from docker-compose/plugin; inspect SLT/e2e logs for true mismatch.
105 - Exit code : common in simulation/recovery steps; inspect uploaded simulation logs.
4 - Exit code : usually cancellation/termination, not a deterministic product regression.
143 - may be null in JSON; use explicit job log endpoints by job id.
raw_log_url - Prefer JSON endpoints plus ; avoid scraping large HTML pages.
jq
- 退出码:通常是docker-compose/插件导致的包装器失败;检查SLT/e2e日志以找到真正的不匹配项。
105 - 退出码:在模拟/恢复步骤中常见;检查上传的模拟日志。
4 - 退出码:通常是取消/终止,而非确定性产品回归。
143 - JSON中的可能为null;使用按作业ID指定的作业日志端点。
raw_log_url - 优先使用JSON端点加;避免抓取大型HTML页面。
jq