Fix Buildkite CI
Overview
Diagnose Buildkite failures programmatically and avoid guessing from UI screenshots. Prefer structured build/job JSON plus artifact inspection to find the exact failing test case and mismatch, then implement the smallest correct fix.
Target Selection
Resolve triage target with this precedence:
- If user provides a Buildkite build URL, use that build directly.
- Else if user specifies a branch and/or a pipeline (for example , ), use the specified scope.
- Else default to the current git branch and inspect the checks for the PR associated with that branch.
Workflow
- Identify the failing Buildkite build(s).
- Retrieve build JSON and list failed jobs.
- Pull job logs and extract the first concrete failure signal.
- Inspect artifacts when top-level logs are truncated.
- Map failure to root cause and apply a focused fix.
- Verify locally where feasible and summarize evidence.
Use
CLI first. If auth is unavailable, use public Buildkite JSON/log/artifact endpoints via
.
For exact commands and endpoint patterns, read
references/buildkite-ci-triage.md
.
Step 1: Identify Failing Buildkite Checks
When no explicit target is given, find the PR for the current branch first, then run
to find failing checks and capture Buildkite URLs (
).
If user specifies a branch/pipeline, list and filter builds with
using those parameters.
If user provides a Buildkite build URL, skip discovery and start from that build number.
Step 2: Pull Build JSON and Failed Jobs
Fetch
, then list failed jobs by non-zero
.
Capture at least:
- pipeline
- build number
- job id
- job name
- exit status
Step 3: Extract the Concrete Failure
Fetch each failed job log and search for high-signal patterns:
[Diff] (-expected|+actual)
query is expected to fail with error:
- panic/assertion lines
- deterministic simulation error markers
- OOM/timeout/cancellation markers
Stop once you have one concrete failing file/case and mismatch.
Step 4: Fall Back to Artifacts
If logs only show wrapper errors (for example, command exited with status), inspect artifacts from the same job, especially:
risedev-logs/nodetype-*.log
Extract and search artifact logs for the exact mismatch.
Step 5: Apply Focused Fixes
Prefer minimal fixes tied to evidence:
- SQLLogicTest mismatch: update expected sections in the correct / file only when query output change is intentional.
- Wrong runtime behavior: fix source code and keep tests as-is.
- Flaky/cancellation-only signal (): treat as infra/cancel unless corroborated by product errors.
Avoid broad "retry and hope" actions without root-cause evidence.
Step 6: Verify and Report
Run the narrowest local check that validates the fix when possible. If full validation is not feasible, state it explicitly.
Always report:
- failing check/build/job identifiers
- failing file/test/case
- exact mismatch/error evidence
- applied fix (files changed)
- verification status and remaining risk
Buildkite-Specific Heuristics
- Exit code : often wrapper failure from docker-compose/plugin; inspect SLT/e2e logs for true mismatch.
- Exit code : common in simulation/recovery steps; inspect uploaded simulation logs.
- Exit code : usually cancellation/termination, not a deterministic product regression.
- may be null in JSON; use explicit job log endpoints by job id.
- Prefer JSON endpoints plus ; avoid scraping large HTML pages.