fix-buildkite-ci

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Fix Buildkite CI

修复Buildkite CI

Overview

概述

Diagnose Buildkite failures programmatically and avoid guessing from UI screenshots. Prefer structured build/job JSON plus artifact inspection to find the exact failing test case and mismatch, then implement the smallest correct fix.

以程序化方式诊断Buildkite失败，避免从UI截图猜测。优先使用结构化的构建/作业JSON加上工件检查，找到确切的失败测试用例和不匹配项，然后实施最小的正确修复。

Target Selection

目标选择

Resolve triage target with this precedence:

If user provides a Buildkite build URL, use that build directly.
Else if user specifies a branch and/or a pipeline (for example
```
pull-request
```
,
```
main-cron
```
), use the specified scope.
Else default to the current git branch and inspect the checks for the PR associated with that branch.

按以下优先级确定排查目标：

如果用户提供Buildkite构建URL，直接使用该构建。
否则如果用户指定了分支和/或流水线（例如
```
pull-request
```
、
```
main-cron
```
），使用指定范围。
否则默认使用当前Git分支，并检查与该分支关联的PR的检查结果。

Workflow

工作流程

Identify the failing Buildkite build(s).
Retrieve build JSON and list failed jobs.
Pull job logs and extract the first concrete failure signal.
Inspect artifacts when top-level logs are truncated.
Map failure to root cause and apply a focused fix.
Verify locally where feasible and summarize evidence.

Use

bk

CLI first. If auth is unavailable, use public Buildkite JSON/log/artifact endpoints via

curl

For exact commands and endpoint patterns, read

references/buildkite-ci-triage.md

识别失败的Buildkite构建。
获取构建JSON并列出失败的作业。
拉取作业日志并提取第一个具体的失败信号。
当顶层日志被截断时，检查工件。
将失败映射到根本原因并应用针对性修复。
尽可能在本地验证并总结证据。

优先使用

bk

CLI。如果无法认证，通过

curl

使用公共Buildkite JSON/日志/工件端点。

有关确切命令和端点模式，请阅读

references/buildkite-ci-triage.md

。

Step 1: Identify Failing Buildkite Checks

步骤1：识别失败的Buildkite检查

When no explicit target is given, find the PR for the current branch first, then run

gh pr checks <PR_NUMBER>

to find failing checks and capture Buildkite URLs (

.../builds/<N>

If user specifies a branch/pipeline, list and filter builds with

bk build list

using those parameters. If user provides a Buildkite build URL, skip discovery and start from that build number.

当未给出明确目标时，先找到当前分支对应的PR，然后运行

gh pr checks <PR_NUMBER>

找到失败的检查并捕获Buildkite URL（

.../builds/<N>

）。

如果用户指定了分支/流水线，使用

bk build list

结合这些参数列出并筛选构建。如果用户提供了Buildkite构建URL，跳过发现步骤，直接从该构建编号开始。

Step 2: Pull Build JSON and Failed Jobs

步骤2：拉取构建JSON和失败作业

Fetch

builds/<N>.json

, then list failed jobs by non-zero

exit_status

Capture at least:

pipeline
build number
job id
job name
exit status

获取

builds/<N>.json

，然后通过非零

exit_status

列出失败的作业。

至少捕获以下信息：

流水线
构建编号
作业ID
作业名称
退出状态

Step 3: Extract the Concrete Failure

步骤3：提取具体失败信息

Fetch each failed job log and search for high-signal patterns:

```
query result mismatch
```
```
[Diff] (-expected|+actual)
```
```
query is expected to fail with error:
```
panic/assertion lines
deterministic simulation error markers
OOM/timeout/cancellation markers

Stop once you have one concrete failing file/case and mismatch.

拉取每个失败作业的日志并搜索高信号模式：

```
query result mismatch
```
```
[Diff] (-expected|+actual)
```
```
query is expected to fail with error:
```
恐慌/断言行
确定性模拟错误标记
OOM/超时/取消标记

找到一个具体的失败文件/用例和不匹配项后即可停止。

Step 4: Fall Back to Artifacts

步骤4： fallback到工件

If logs only show wrapper errors (for example, command exited with status), inspect artifacts from the same job, especially:

```
risedev-logs.zip
```
```
risedev-logs/nodetype-*.log
```

Extract and search artifact logs for the exact mismatch.

如果日志仅显示包装器错误（例如命令以状态码退出），检查同一作业的工件，尤其是：

```
risedev-logs.zip
```
```
risedev-logs/nodetype-*.log
```

提取并搜索工件日志以找到确切的不匹配项。

Step 5: Apply Focused Fixes

步骤5：应用针对性修复

Prefer minimal fixes tied to evidence:

SQLLogicTest mismatch: update expected sections in the correct
```
.slt
```
/
```
.slt.part
```
file only when query output change is intentional.
Wrong runtime behavior: fix source code and keep tests as-is.
Flaky/cancellation-only signal (
```
143
```
): treat as infra/cancel unless corroborated by product errors.

Avoid broad "retry and hope" actions without root-cause evidence.

优先采用与证据绑定的最小修复：

SQLLogicTest不匹配：仅当查询输出变更为有意时，更新正确
```
.slt
```
/
```
.slt.part
```
文件中的预期部分。
错误的运行时行为：修复源代码并保持测试不变。
不稳定/仅取消信号（
```
143
```
）：除非有产品错误佐证，否则视为基础设施/取消问题。

避免在没有根本原因证据的情况下进行宽泛的“重试碰运气”操作。

Step 6: Verify and Report

步骤6：验证并报告

Run the narrowest local check that validates the fix when possible. If full validation is not feasible, state it explicitly.

Always report:

failing check/build/job identifiers
failing file/test/case
exact mismatch/error evidence
applied fix (files changed)
verification status and remaining risk

尽可能运行最窄范围的本地检查来验证修复。如果无法进行完整验证，请明确说明。

始终报告以下内容：

失败的检查/构建/作业标识符
失败的文件/测试/用例
确切的不匹配/错误证据
应用的修复（更改的文件）
验证状态和剩余风险

Buildkite-Specific Heuristics

Buildkite特定启发式规则

Exit code
```
105
```
: often wrapper failure from docker-compose/plugin; inspect SLT/e2e logs for true mismatch.
Exit code
```
4
```
: common in simulation/recovery steps; inspect uploaded simulation logs.
Exit code
```
143
```
: usually cancellation/termination, not a deterministic product regression.
```
raw_log_url
```
may be null in JSON; use explicit job log endpoints by job id.
Prefer JSON endpoints plus
```
jq
```
; avoid scraping large HTML pages.

退出码
```
105
```
：通常是docker-compose/插件导致的包装器失败；检查SLT/e2e日志以找到真正的不匹配项。
退出码
```
4
```
：在模拟/恢复步骤中常见；检查上传的模拟日志。
退出码
```
143
```
：通常是取消/终止，而非确定性产品回归。
JSON中的
```
raw_log_url
```
可能为null；使用按作业ID指定的作业日志端点。
优先使用JSON端点加
```
jq
```
；避免抓取大型HTML页面。