/cheat-bump — Rubric / Bucket Upgrade
Two modes:
| Mode | Trigger | What it does | Validation Strength |
|---|
| Full rubric bump | --propose "<new formula>"
| Modify formula / dimensions / weights | Mandatory 5-step process + cross-model audit |
| --bucket-only recalibration | | Only re-derive bucket boundaries | Automatic data derivation, no audit |
Full rubric bump strictly follows the 5 steps in shared-references/bump-validation-protocol.md. Bucket-only follows the lightweight path — see Phase B below.
Overview
Entry: User triggers /cheat-bump
↓
[Phase A0: Detect Call Mode]
↓
├─ --bucket-only → [Phase B: Lightweight Bucket Recalibration]
└─ --propose → [Phase 0~6: Full Rubric Bump]
Phase A0: Call Mode Diversion (Do First)
Read user parameters:
- Contains → proceed to Phase B (lightweight recalibration)
- Contains → proceed to Phase 0~8 (full rubric bump)
- Neither → Ask user: "What do you want to do? 1) Adjust rubric formula / add/remove dimensions → --propose; 2) Only re-derive bucket boundaries → --bucket-only"
If user says "I think ER is too low and want to adjust" → it's the
path.
If user says "My account has grown, buckets are no longer accurate" → it's the
path.
The two paths cannot be mixed — only one type of operation per action.
Full Rubric Bump Workflow
[User: upgrade rubric --propose "ER×1.5→2.0, remove NA, add MS"]
↓
[Phase 0: Pre-threshold Check]
↓
[Phase 1: Write Complete New Formula Equation]
↓
[Phase 2: Full Re-scoring of Calibration Pool]
↓
[Phase 3: Calculate Ranking Consistency]
↓
[Phase 4: Mandatory Cross-Model Independent Audit]
↓
[Phase 5: Implementation + Cleanup Pass]
↓
[Phase 6: Append Re-scored Line to Bottom of All Calibration Sample Prediction Files]
Constants
- READINESS_HEURISTIC —
- Default Reference: Calibration pool ≥5 samples + at least 1 cross-sample observation supported by ≥3 samples
- But Claude can propose bump (even with few samples) if observation signals are exceptionally strong:
- N=3 but there's a strong counterexample that completely overturns current rubric assumptions (e.g., composite 8.5 vs actual performance 50k, a ≥3x deviation)
- A single post shows an extreme phenomenon (e.g., a single meme with ≥2000 likes in the comment section)
- Claude can also reject bump (even with sufficient samples) if evidence is weak:
- N=10 but observations are low-confidence fragmented patterns with no clear direction
- User reviews contain numerous non-serious judgments like "just glanced at it"
- Must be stated in prediction header or cheat-bump output: Whether this proposal is default-aligned or judgment-driven, providing users with a basis for review
- THRESHOLD = 0.8 — Consistency threshold between new ranking and actual performance ranking (4/5). This is hard-coded — statistical rigidity for bump validation
- CROSS_MODEL_AUDIT = true — Call external LLM for independent audit. false is only used for offline scenarios
- REQUIRE_CONFIRM = true — Require explicit user confirmation "yes, bump" before implementation
Inputs
| Required | Source |
|---|
| text | User parameters; ask if missing |
| User project root |
| All files | Calibration pool data |
| State file |
Workflow
Phase 0: Pre-threshold Check
Check item by item according to the "When to Prohibit" section in bump-validation-protocol.md:
| Check | Failure Handling |
|---|
| Total calibration pool samples vs observation strength | Claude's judgment — follow READINESS_HEURISTIC: default ≥5 samples but allow exceptions (strong counterexamples / strong memes). If default is not met, Claude must explicitly explain why bump is still proposed ("Although only N=3 samples, entry X has composite Y vs actual performance Z, a W-fold deviation"), allowing user review |
| Number of new calibrations since last bump vs observation maturity | Claude's judgment — default suggests ≥3 new samples, but if 3 consecutive samples all provide strong evidence pointing to the same direction → no need to wait |
in_progress_session == null
| Reject: "You have an in-progress prediction that is not completed. Finish that process first or clear the state" |
| Trigger conditions are met (systematic deviation / new cross-sample observations / sufficient evidence for new dimensions) | Warn but do not block — ask user why they want to bump now |
Pass → proceed to Phase 1.
Phase 1: Write Complete New Formula Equation
Do not only accept brief user descriptions. Expand it into a complete equation:
Current: v2 composite = (ER×1.5 + SR×1.5 + HP×1.5 + QL + NA + AB + SAT) / 8.5 × 2.0
Proposed: v2.1 composite = (ER×2.0 + HP×1.5 + MS×1.5 + QL + SR + TS + SAT) / 9.0 × 2.0
Summary of changes:
- ER ×1.5 → ×2.0 (increased)
- SR ×1.5 → ×1.0 (decreased)
- Added MS ×1.5 (Memetic Shareability)
- Added TS ×1.0 (Topic Shareability)
- Removed NA (overlaps with HP)
- Removed AB (replaced by TS)
- Normalization constant 8.5 → 9.0
- Total number of formula dimensions: 7 → 7 (net change 0)
If user's proposal is vague (e.g., "Increase ER weight a bit") → ask for specific values, do not guess.
Phase 2: Full Re-scoring of Calibration Pool (Mandatory Blind Sub-agent)
Glob all files in
with complete review sections → calibration pool.
Bump is the highest-risk action of the tool — all re-scoring must go through cheat-score-blind sub-agent. Inline re-scoring means the main Claude has already seen actual performance, making rank consistency overfitting rather than real signals.
Mandatory Constraints
- No self-scored fallback accepted — has a flag, but does not. If Task tool is unavailable → abort bump, report to user "Resolve Task tool issues first before bumping"
- No "I only recalculate composite without re-scoring dimensions" — even if the new formula only adjusts weights without adding dimensions, all dimensions of each prediction must be re-reviewed by the sub-agent. Reason: Old dimension scores may already be contaminated; weight changes cannot guarantee old dimensions are still valid
For Each Prediction:
- Parse the prediction file to get the corresponding path (from the header field)
- Verify the script file exists + hash matches the header ; if not → warn (script has been modified) but still spawn sub-agent
- Spawn cheat-score-blind sub-agent via Task tool:
Spawn cheat-score-blind sub-agent.
Input:
script_path: <Script Path from prediction header>
rubric_notes_path: rubric_notes.md
sidecar_path: .cheat-cache/bump-rescores/<prediction-id>.json
Task: Score the script according to the current formula in rubric_notes (already updated to new version vN+1).
Return strict JSON. Write to sidecar file for batch reading by the main bump process.
Do not read state file / predictions/ / videos/ or any other files.
Do not ask user — you have no user.
Do not read this prediction file itself — only look at script + rubric.
- Wait for sub-agent to complete → read sidecar JSON → main process calculates composite using new formula
- Write "re-score table" to
.cheat-cache/bump-rescores.json
(summary). Mark each entry with — during Phase 5 cleanup of bump, write this field along with the new score to the line in the prediction file
Honest Labeling of Contamination
Even with sub-agent, two types of residual contamination must be honestly labeled in the bump report:
| Type | Source | Label Field |
|---|
| Model prior contamination | Sub-agent is still Claude, sharing RLHF | model_prior_warning: true
(default true, cannot be turned off) |
| User's own rubric design bias | rubric_notes.md is written by user, naturally fitting their own content | rubric_self_designed: true
(default true, cannot be turned off) |
These two remind users that channel C (cross-model audit) is indispensable. The end of the bump report must state: "The above rank consistency is within channel A. Final decision must wait for channel C audit approval."
Failure Modes
| Symptom | Handling |
|---|
| Script file for a prediction is missing | Sub-agent skips this entry, main process summarizes "N entries excluded due to missing script". If remaining valid pool < MIN_SAMPLES → abort bump |
| Sub-agent returns | Resend Task up to 3 times; if still failed → mark this entry as and exclude from calibration pool |
| Task tool is completely unavailable | Abort bump, prompt user "Task tool is a hard dependency for bump. If in offline environment, run /cheat-bump --bucket-only
to use lightweight branch" |
| Sub-agent output contains contamination_signal | Mark as but do not exclude — list these suspicious entries at the end of bump report for user review |
Phase 3: Calculate Ranking Consistency
For each sample:
new_composite_rank: Rank sorted by new formula
actual_plays_rank: Rank sorted by actual plays
delta: |new_rank - actual_rank|
Output comparison table:
| Sample | composite (v2) | composite (v2.1) | rank (new) | actual | rank (actual) | delta |
|---|---|---|---|---|---|---|
| Hamster | 9.41 | 9.55 | 1 | 1.248M | 1 | 0 |
| Stop Expecting | 8.24 | 9.11 | 2 | 711K | 2 | 0 |
| Boss Nonsense | 7.65 | 8.11 | 4 | 396K | 3 | 1 |
| Job Hunting Paradox | 8.47 | 7.56 | 5 | 168K | 4 | 1 |
| Who Asked You | 8.24 | 7.00 | 6 | 117K | 5 | 1 |
Ranking consistency: 4/5 with |delta| ≤ 1
Pairwise no-regression: All pairs correctly ranked by old formula are not reversed under new formula ✓
Judgment:
- Ranking consistency < THRESHOLD (default 0.8) → Local rejection, explicitly report failure before proceeding to Phase 4
- Pairwise regression occurs → Local rejection
is hard-coded in the protocol — temporary lowering is not allowed (that itself is another meta-decision requiring bump).
Phase 4: Mandatory Cross-Model Independent Audit (Mandatory, except escape hatch)
prompt:
You are an independent reviewer. Below is a rubric formula that a content creator is preparing to upgrade.
Please independently judge two things:
1. Ranking consistency: Is the ranking of samples by the new formula consistent with the ranking of actual performance in ≥80% of samples?
2. Explanatory power: Does the new formula better explain the actual performance distribution of the calibration pool compared to the old formula?
Data:
Old formula: (ER×1.5 + SR×1.5 + HP×1.5 + QL + NA + AB + SAT) / 8.5 × 2.0
New formula: (ER×2.0 + HP×1.5 + MS×1.5 + QL + SR + TS + SAT) / 9.0 × 2.0
Calibration pool:
[Full JSON of re-score table from Phase 2]
Ranking comparison:
[Full JSON of table from Phase 3]
Output format:
- Judgment: PASS or REJECT
- Reason: ≥100 words
- Key risks: [List potential issues of new formula if any]
Receive external LLM response → parse judgment.
Judgment logic:
- Local PASS + External PASS → Pass, proceed to Phase 5
- Local PASS + External REJECT → Treat as REJECT. Conflict means at least one party's interpretation is unstable
- Local REJECT → Already terminated in Phase 3
- mcp__llm-chat__chat unavailable → Gracefully degrade to , mark
last_bump_self_audited: true
in state file
- Only rely on local judgment
- Continuously mark in state file, cheat-status continuously prompts user "This bump was self-audited, it is recommended to configure mcp__llm-chat__chat"
Phase 5: Implementation + Cleanup Pass
After passing audit, REQUIRE_CONFIRM=true → Ask user: "New formula passed local and external audits. Final confirmation: Execute bump implementation? This will modify rubric_notes.md + rubric-memo.md and delete several absorbed observations. Only execute if you answer 'yes, bump'."
After user confirmation:
5a. Update (Only use general language, no video names / actual performance data)
- Update top metadata:
**Current Version**: vN+1
**Last bumped at**: <ISO 8601>
**Upgrade memos**: See [rubric-memo.md](rubric-memo.md)
(pointer, do not copy memo content)
- Add a line to version quick reference table (only include version number + formula signature, no evidence samples)
- Update "Current Scoring Dimensions" section (remove NA / AB, add MS / TS)
- Derived evidence section if new dimensions need anchor explanation → Use general language:
- ✅ Allowed: "Derived evidence: High abstract density samples → CC=1 → Low reach"
- ❌ Prohibited: "Derived evidence: 'Stop Expecting' CC=1 → Actual performance 137K" (video name + actual performance number)
- If prohibited pattern is hit → move this section to "Derived Evidence" sub-section in rubric-memo.md, replace with general language in place
5b. Write Memo to (Append mode, do not overwrite history)
Append a memo section to the end of the file according to Step 5 in bump-validation-protocol.md + templates/rubric-memo.template.md:
- Trigger observation (include real observation ID)
- Evidence data (Full re-score table + ranking comparison of calibration pool, include real video names + actual performance)
- Derived evidence (Include real sample names + actual performance)
- Diagnosis
- New formula
- Cross-model audit conclusion reference (include model name + judgment + reason excerpt)
- Known limitations
Never overwrite existing content in rubric-memo.md — bump memos accumulate in chronological order.
5c. Cleanup Pass (According to "Mandatory Cleanup Pass Timing" in observation-lifecycle.md)
Execute within
(
Do not modify rubric-memo.md):
- Observations that have been absorbed into new dimensions → Delete (e.g., Observation E absorbed into MS → delete Observation E)
- Observations overturned by new data → Delete
- Unresolved observations → Move to "Pending Validation Hypotheses" section of new version
- Validated "rules" → Move to "Rule Precipitation Area"
5d. Organize + Self-Check
- Re-read entire to ensure readers can understand current rules within 60 seconds → trigger additional cleanup if exceeding 600 lines
- Self-check leak guard: Run
grep -E '\\d+\\s*[wWmMkK]|plays|actual performance|actual'
on → if any hits → abort bump + rollback, prompt user "rubric_notes.md contains prohibited content (actual performance / play counts)". These contents should be in rubric-memo.md, not rubric_notes.md
Phase 6: Batch Update of Calibration Samples
For each calibration sample's prediction file, append to the bottom (do not modify prediction section or review section):
markdown
---
**Re-scored under v2.1 on 2026-05-04**: composite=8.24 → 9.11 (blind: true)
(Full re-calculated during rubric bump, independently scored by cheat-score-blind sub-agent; see v2 → v2.1 upgrade memo in rubric-memo.md)
The
field is
required — tell future readers of this record "This is channel B isolated scoring, not self-scored by main Claude". If a prediction was excluded in Phase 2 due to sub-agent failure → no Re-scored line will be added (keep as is).
Use Edit tool to match the end of each file.
Phase 7: Update State File
json
{
"rubric_version": "v2.1",
"last_bump_at": "<ISO timestamp>",
"last_bump_self_audited": false,
"consecutive_directional_errors": [],
"calibration_samples_at_last_bump": <current value>
}
Clear
consecutive_directional_errors
— new rubric starts counting again.
Phase 8: Console Report
✅ Rubric upgraded from v2 → v2.1
Changes:
- ER ×1.5 → ×2.0
- SR ×1.5 → ×1.0
- Added MS / TS
- Removed NA / AB
Calibration pool re-scoring: 5/5 passed ranking check (4/5 consistent + 0 pairwise regression)
Cross-model audit: ✅ PASS
Cleanup pass: Deleted Observations D and E (absorbed into QL redefinition and MS dimension)
Scoring will use v2.1 formula starting from next prediction.
All historical prediction files have been appended with Re-scored marks.
Phase B: Bucket-Only Recalibration (Lightweight Branch)
/cheat-bump --bucket-only [--scheme ratio|absolute|percentile]
Essential difference from full bump: Bucket boundaries are not part of the rules, they are data-derived quantities. Re-deriving them does not require cross-model audit — the derivation algorithm is deterministic with no "judgment" component.
B1: Select Algorithm (Automatically Derived Based on Available Sample Count, Scheme Not Stored in State)
| Algorithm | Applicable | Boundary Derivation Method |
|---|
| (default for N=1-4) | Small sample size | Median of last 1 / last 3 samples × {0.3 / 1 / 3 / 10 / 30} |
| (default for N=5-9) | Medium sample size | Median of entire calibration pool × {0.3 / 1 / 3 / 10 / 30}, fixed boundaries |
| (default for N≥10) | Large sample size | Actual performance percentiles of calibration pool {30 / 60 / 85 / 95 / 100} |
The
parameter allows users to
explicitly override default:
- forces use of ratio (even if N≥5)
- forces use of absolute
- forces use of percentile (requires N≥3, otherwise error)
If
is not specified → automatically derived according to the table above.
Old design had
state field — removed in v1.1. All skills derive algorithm in real-time based on calibration_samples, no need to persist "which one is currently used". This avoids state inconsistency issues like "forgot to sync after switching scheme".
B2: Derive New Boundaries
Read all samples with
in
.
Ratio Mode:
baseline = median(last 3 actual_plays)
buckets = {
"Decline": (-inf, baseline * 0.3),
"Stable": (baseline * 0.3, baseline * 1),
"Hit": (baseline * 1, baseline * 3),
"Small Viral": (baseline * 3, baseline * 10),
"Big Viral": (baseline * 10, +inf),
}
Absolute Mode:
baseline = median(all calibration pool actual_plays)
buckets = {
"Bottom": (-inf, baseline * 0.3),
"Base Audience": (baseline * 0.3, baseline * 1),
"Hit": (baseline * 1, baseline * 3),
"Viral": (baseline * 3, baseline * 10),
"Phenomenal": (baseline * 10, +inf),
}
Percentile Mode:
sorted_plays = sorted(all calibration pool actual_plays)
buckets = {
"Bottom": ≤ p30,
"Base Audience": p30 - p60,
"Hit": p60 - p85,
"Small Viral": p85 - p95,
"Big Viral": ≥ p95,
}
B3: Report Changes + User Confirmation
Current bucket scheme: ratio
Proposed scheme: absolute
Baseline: 42K median (based on 5 calibration samples)
New boundaries:
- Bottom: < 12.6K
- Base Audience: 12.6K - 42K
- Hit: 42K - 126K
- Viral: 126K - 420K
- Phenomenal: > 420K
Derivation explanation:
- 5 actual performances: 15K / 38K / 42K / 56K / 180K
- Median is 42K, new buckets derived by ×{0.3, 1, 3, 10}
Confirm application? (yes / no)
B4: Implementation
After user confirmation:
- Edit the "Bucket Scheme" section in , replace with new table
- Update field in (bucket scheme is not persisted — derived in real-time during next cheat-predict)
- Append a change record to the top of the bucket section in :
v2 buckets recalibrated on YYYY-MM-DD: scheme=absolute, baseline=42K (based on N=10 samples)
- Do not modify any prediction files — bucket tags in historical predictions remain as they are (judgments made under the scheme at the time of writing the sample)
B5: Impact on Future Predictions
Starting from the next
, new buckets will be derived. Bucket tags in historical prediction files
will not be recalculated — buckets are semantic judgments made at prediction time, post-hoc rewriting will destroy blindness.
What Phase B Does Not Do
- No re-calculation of composite (formula remains unchanged)
- No re-review of observation sections (rubric remains unchanged)
- No cross-model audit (deterministic derivation requires no judgment)
- No strict sample count threshold (judged by Claude according to READINESS_HEURISTIC; ratio mode can run with N=1)
Key Rules
- 5 steps cannot be skipped (only for full rubric bump). Reject any request to "run a simplified version first"
- THRESHOLD is hard-coded (only for full rubric bump). Dynamic adjustment is not allowed
- Cross-model audit is default (only for full rubric bump). Turning off audit requires explicit marking in state file
- Cleanup pass is part of bump (only for full rubric bump). Bump cannot be completed without cleaning observation sections
- REQUIRE_CONFIRM (both modes). Must get explicit user confirmation "yes, bump" or "yes, recalibrate" before final implementation
- Bucket recalibration does not modify historical predictions. Buckets are prediction-time semantics, post-hoc rewriting destroys blindness
Refusals
- "Skip calibration pool re-scoring, directly change formula" → Reject. Rule #2
- "Skip cheat-score-blind sub-agent, main Claude can re-score directly" → Reject. Bump does not accept any self-scored fallback — if sub-agent is unavailable → abort bump, do not accept "self-audit"
- "Skip external LLM audit" → Only allowed if is explicitly set
- "Adjust THRESHOLD to 3/5 this time to let it pass" → Reject. Changing THRESHOLD is a meta-level bump
- "Keep all old observations as history" → Violates Rule #3
- "Bump first, do cleanup next time" → Reject. Cleanup is part of bump
- "Only recalculate composite without re-scoring dimensions" → Reject. New weights × old dimensions are still old contamination. Each dimension must be re-reviewed by sub-agent
- "Write full memo into top of rubric_notes.md for easy reading" → Reject. rubric_notes.md is whitelisted for blind sub-agent — containing video names / actual performance → leaks through whitelist. Memo is written in rubric-memo.md (outside whitelist), rubric_notes.md only contains formula + general language dimension definitions + pointer
- "Keep real video names in derived evidence section to make rubric more specific" → Reject. Must use general language in rubric_notes.md ("high abstract density samples"); derived evidence with video names is written in rubric-memo.md
Integration
- Upstream: detects ≥3 same-direction deviations → propose running
- Dependencies: (if configured) + Task tool (spawn cheat-score-blind)
- Modifications:
- (structural update, never write real video names / actual performance)
- (new — append full memo, including evidence + derived evidence)
- All (append Re-scored line, do not modify prediction section)
- Downstream: Next automatically uses new rubric_version for scoring