fix-buildkite-ci

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Fix Buildkite CI

修复Buildkite CI

Overview

概述

Diagnose Buildkite failures programmatically and avoid guessing from UI screenshots. Prefer structured build/job JSON plus artifact inspection to find the exact failing test case and mismatch, then implement the smallest correct fix.
以程序化方式诊断Buildkite失败,避免从UI截图猜测。优先使用结构化的构建/作业JSON加上工件检查,找到确切的失败测试用例和不匹配项,然后实施最小的正确修复。

Target Selection

目标选择

Resolve triage target with this precedence:
  • If user provides a Buildkite build URL, use that build directly.
  • Else if user specifies a branch and/or a pipeline (for example
    pull-request
    ,
    main-cron
    ), use the specified scope.
  • Else default to the current git branch and inspect the checks for the PR associated with that branch.
按以下优先级确定排查目标:
  • 如果用户提供Buildkite构建URL,直接使用该构建。
  • 否则如果用户指定了分支和/或流水线(例如
    pull-request
    main-cron
    ),使用指定范围。
  • 否则默认使用当前Git分支,并检查与该分支关联的PR的检查结果。

Workflow

工作流程

  1. Identify the failing Buildkite build(s).
  2. Retrieve build JSON and list failed jobs.
  3. Pull job logs and extract the first concrete failure signal.
  4. Inspect artifacts when top-level logs are truncated.
  5. Map failure to root cause and apply a focused fix.
  6. Verify locally where feasible and summarize evidence.
Use
bk
CLI first. If auth is unavailable, use public Buildkite JSON/log/artifact endpoints via
curl
.
For exact commands and endpoint patterns, read
references/buildkite-ci-triage.md
.
  1. 识别失败的Buildkite构建。
  2. 获取构建JSON并列出失败的作业。
  3. 拉取作业日志并提取第一个具体的失败信号。
  4. 当顶层日志被截断时,检查工件。
  5. 将失败映射到根本原因并应用针对性修复。
  6. 尽可能在本地验证并总结证据。
优先使用
bk
CLI。如果无法认证,通过
curl
使用公共Buildkite JSON/日志/工件端点。
有关确切命令和端点模式,请阅读
references/buildkite-ci-triage.md

Step 1: Identify Failing Buildkite Checks

步骤1:识别失败的Buildkite检查

When no explicit target is given, find the PR for the current branch first, then run
gh pr checks <PR_NUMBER>
to find failing checks and capture Buildkite URLs (
.../builds/<N>
).
If user specifies a branch/pipeline, list and filter builds with
bk build list
using those parameters. If user provides a Buildkite build URL, skip discovery and start from that build number.
当未给出明确目标时,先找到当前分支对应的PR,然后运行
gh pr checks <PR_NUMBER>
找到失败的检查并捕获Buildkite URL(
.../builds/<N>
)。
如果用户指定了分支/流水线,使用
bk build list
结合这些参数列出并筛选构建。 如果用户提供了Buildkite构建URL,跳过发现步骤,直接从该构建编号开始。

Step 2: Pull Build JSON and Failed Jobs

步骤2:拉取构建JSON和失败作业

Fetch
builds/<N>.json
, then list failed jobs by non-zero
exit_status
.
Capture at least:
  • pipeline
  • build number
  • job id
  • job name
  • exit status
获取
builds/<N>.json
,然后通过非零
exit_status
列出失败的作业。
至少捕获以下信息:
  • 流水线
  • 构建编号
  • 作业ID
  • 作业名称
  • 退出状态

Step 3: Extract the Concrete Failure

步骤3:提取具体失败信息

Fetch each failed job log and search for high-signal patterns:
  • query result mismatch
  • [Diff] (-expected|+actual)
  • query is expected to fail with error:
  • panic/assertion lines
  • deterministic simulation error markers
  • OOM/timeout/cancellation markers
Stop once you have one concrete failing file/case and mismatch.
拉取每个失败作业的日志并搜索高信号模式:
  • query result mismatch
  • [Diff] (-expected|+actual)
  • query is expected to fail with error:
  • 恐慌/断言行
  • 确定性模拟错误标记
  • OOM/超时/取消标记
找到一个具体的失败文件/用例和不匹配项后即可停止。

Step 4: Fall Back to Artifacts

步骤4: fallback到工件

If logs only show wrapper errors (for example, command exited with status), inspect artifacts from the same job, especially:
  • risedev-logs.zip
  • risedev-logs/nodetype-*.log
Extract and search artifact logs for the exact mismatch.
如果日志仅显示包装器错误(例如命令以状态码退出),检查同一作业的工件,尤其是:
  • risedev-logs.zip
  • risedev-logs/nodetype-*.log
提取并搜索工件日志以找到确切的不匹配项。

Step 5: Apply Focused Fixes

步骤5:应用针对性修复

Prefer minimal fixes tied to evidence:
  • SQLLogicTest mismatch: update expected sections in the correct
    .slt
    /
    .slt.part
    file only when query output change is intentional.
  • Wrong runtime behavior: fix source code and keep tests as-is.
  • Flaky/cancellation-only signal (
    143
    ): treat as infra/cancel unless corroborated by product errors.
Avoid broad "retry and hope" actions without root-cause evidence.
优先采用与证据绑定的最小修复:
  • SQLLogicTest不匹配:仅当查询输出变更为有意时,更新正确
    .slt
    /
    .slt.part
    文件中的预期部分。
  • 错误的运行时行为:修复源代码并保持测试不变。
  • 不稳定/仅取消信号(
    143
    ):除非有产品错误佐证,否则视为基础设施/取消问题。
避免在没有根本原因证据的情况下进行宽泛的“重试碰运气”操作。

Step 6: Verify and Report

步骤6:验证并报告

Run the narrowest local check that validates the fix when possible. If full validation is not feasible, state it explicitly.
Always report:
  • failing check/build/job identifiers
  • failing file/test/case
  • exact mismatch/error evidence
  • applied fix (files changed)
  • verification status and remaining risk
尽可能运行最窄范围的本地检查来验证修复。如果无法进行完整验证,请明确说明。
始终报告以下内容:
  • 失败的检查/构建/作业标识符
  • 失败的文件/测试/用例
  • 确切的不匹配/错误证据
  • 应用的修复(更改的文件)
  • 验证状态和剩余风险

Buildkite-Specific Heuristics

Buildkite特定启发式规则

  • Exit code
    105
    : often wrapper failure from docker-compose/plugin; inspect SLT/e2e logs for true mismatch.
  • Exit code
    4
    : common in simulation/recovery steps; inspect uploaded simulation logs.
  • Exit code
    143
    : usually cancellation/termination, not a deterministic product regression.
  • raw_log_url
    may be null in JSON; use explicit job log endpoints by job id.
  • Prefer JSON endpoints plus
    jq
    ; avoid scraping large HTML pages.
  • 退出码
    105
    :通常是docker-compose/插件导致的包装器失败;检查SLT/e2e日志以找到真正的不匹配项。
  • 退出码
    4
    :在模拟/恢复步骤中常见;检查上传的模拟日志。
  • 退出码
    143
    :通常是取消/终止,而非确定性产品回归。
  • JSON中的
    raw_log_url
    可能为null;使用按作业ID指定的作业日志端点。
  • 优先使用JSON端点加
    jq
    ;避免抓取大型HTML页面。