The smoke gate is binary (
→ pass/fail). The corpus benchmarks captured over time form a curve — and curves catch regressions the gate misses (win rate slowly creeping from 100% to 85% is "still passing" by smoke but a real degradation).
This skill reads every persisted run in
docs/benchmarks/runs/*.json
and reports first→last deltas plus a per-run series, flagging regressions in win rate or latency.
-
Run the trend script from the project root:
bash
node plugins/ruflo-cost-tracker/scripts/trend.mjs
Optional env:
- — emit JSON instead of markdown
- — consider only the most recent N runs
-
Inspect the drift summary — first vs last on win rate, avg latency, p99, escalation rate, speedup vs Gemini.
-
Inspect the per-run series — one row per run, including Sonnet 4.6 + Opus 4.7 baseline latencies if those were enabled (
at run time).
-
Regression flags — the script emits
callouts when:
- Win rate dropped between first and last run
- Avg latency rose ≥ 1.5× from first run