systematic-debugging

Original🇨🇳 Chinese
Translated
2 scripts

Use this when encountering any errors, test failures, or unexpected behavior before proposing a fix

2installs
Added on

NPX Install

npx skill4agent add zhucl1006/ailesuperpowers systematic-debugging

SKILL.md Content (Chinese)

View Translation Comparison →

System Debugging

Overview

Random fixes waste time and introduce new bugs. Quick patches mask the root problem.
Core Principle: Always find the root cause before attempting a fix. Symptom-based fixes fail.
Violating this process literally violates the spirit of debugging.

Iron Rule

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
You cannot propose a fix if you haven't completed the first phase.

When to Use

For any technical issue:
  • Test failures
  • Production errors
  • Unexpected behavior
  • Performance issues
  • Build failures
  • Integration problems
Use this especially when:
  • Under time pressure (urgency makes it easy to guess)
  • "Just a quick fix" seems obvious
  • You've already tried multiple fixes
  • Previous fixes didn't work
  • You don't fully understand the problem
Do NOT skip this when:
  • The problem seems simple (simple bugs also have root causes)
  • You're in a hurry (hurry guarantees rework)
  • Managers want an immediate fix (systematic approach is faster than chaos)

Four Phases

You must complete each phase before moving to the next.

Phase 1: Root Cause Investigation

Before attempting any fix:
  1. Read error messages carefully
    • Don't skip past errors or warnings
    • They often contain precise solutions
    • Read the full stack trace
    • Note line numbers, file paths, error codes
  2. Reproduce consistently
    • Can you trigger it reliably?
    • What are the exact steps?
    • Does it happen every time?
    • If non-reproducible → collect more data, don't guess
  3. Check recent changes
    • What change might have caused this?
    • Git diff, recent commits
    • New dependencies, configuration changes
    • Environment differences
  4. Gather evidence in multi-component systems
When the system has multiple components (CI → Build → Signing, API → Service → Database):
Before proposing a fix, add diagnostic tools:
For EACH component boundary:
  - Log what data enters component
  - Log what data exits component
  - Verify environment/config propagation
  - Check state at each layer

Run once to gather evidence showing WHERE it breaks
THEN analyze evidence to identify failing component
THEN investigate that specific component
Example (multi-layer system):
bash
# Layer 1: Workflow
echo "=== Secrets available in workflow: ==="
echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

# Layer 2: Build script
echo "=== Env vars in build script: ==="
env | grep IDENTITY || echo "IDENTITY not in environment"

# Layer 3: Signing script
echo "=== Keychain state: ==="
security list-keychains
security find-identity -v

# Layer 4: Actual signing
codesign --sign "$IDENTITY" --verbose=4 "$APP"
This reveals: Which layer failed (Secrets → Workflow ✓, Workflow → Build ✗)
  1. Trace data flow
When errors are deep in the call stack:
See
root-cause-tracing.md
in this directory for full backward tracing techniques.
Quick version:
  • Where does the bad value come from?
  • What called this with the bad value?
  • Keep tracing until you find the source
  • Fix at the source, not the symptom

Phase 2: Pattern Analysis

Find patterns before fixing:
  1. Look for working examples
    • Find similar working code in the same codebase
    • What works that's similar to what's broken?
  2. Compare to references
    • If implementing a pattern, read the reference implementation fully
    • Don't skim - read every line
    • Understand the pattern fully before applying
  3. Identify differences
    • What's different between working and broken?
    • List every difference, no matter how small
    • Don't assume "that doesn't matter"
  4. Understand dependencies
    • What other components does this require?
    • What settings, configurations, environment?
    • What assumptions does it make?

Phase 3: Hypothesis and Testing

Scientific Method:
  1. Form a single hypothesis
    • State clearly: "I think X is the root cause because Y"
    • Write it down
    • Be specific, not vague
  2. Minimal testing
    • Make the smallest possible change to test the hypothesis
    • One variable at a time
    • Don't fix multiple issues at once
  3. Verify before proceeding
    • Did it work? Yes → Phase 4
    • Didn't work? Form a new hypothesis
    • Don't add more fixes on top
  4. When you don't know
    • Say "I don't understand X"
    • Don't pretend to know
    • Ask for help
    • Research more

Phase 4: Implementation

Fix the root cause, not the symptom:
  1. Create a failing test case
    • Simplest reproduction possible
    • Automated test if possible
    • One-off test script if no framework
    • Must have this before fixing
    • Use
      superpowers:test-driven-development
      skill to write correct failing tests
  2. Implement a single fix
    • Address the identified root cause
    • One change at a time
    • No "while I'm here" improvements
    • No bundled refactoring
  3. Verify the fix
    • Do tests pass now?
    • No other tests broken?
    • Is the problem truly solved?
  4. If the fix doesn't work
    • Stop
    • Count: How many fixes have you tried?
    • If < 3: Return to Phase 1, re-analyze with new info
    • If ≥ 3: Stop and question the architecture (Step 5 below)
    • Don't attempt fix #4 without architectural discussion
  5. If 3+ fixes fail: Architectural Problem
Patterns indicating architectural issues:
  • Each fix reveals new shared state/coupling/problems in different places
  • Fix requires "massive refactoring" to implement
  • Each fix creates new symptoms elsewhere
Stop and ask fundamentals:
  • Is this pattern fundamentally sound?
  • Are we "sticking with it purely out of inertia"?
  • Should we refactor the architecture or keep fixing symptoms?
Discuss with your human partner before attempting more fixes
This isn't a failed hypothesis - it's a flawed architecture.

Red Flags - Stop and Follow the Process

If you catch yourself thinking:
  • "Quick fix now, investigate later"
  • "Try changing X to see if it works"
  • "Add multiple changes, run tests"
  • "Skip testing, I'll verify manually"
  • "Probably X, let me fix that"
  • "I don't fully understand, but this might work"
  • "The pattern says X, but I'll adapt it differently"
  • "Here are the main issues: [list of uninvestigated fixes]"
  • Propose solutions before tracing data flow
  • "Just try one more fix" (after 2+ attempts)
  • Each fix reveals new problems in different places
All of these mean: Stop. Return to Phase 1.
If 3+ fixes fail: Question the architecture (see Phase 4.5)

Signals from Your Human Partner That You're Doing It Wrong

Watch for these redirects:
  • "Is that what's happening?" - You assumed without verifying
  • "Would it tell us...?" - You should add evidence collection
  • "Stop guessing" - You're proposing fixes without understanding
  • "Ultrathink this" - Question fundamentals, not just symptoms
  • "Are we stuck?" (frustration) - Your approach isn't working
When you see these: Stop. Return to Phase 1.

Common Rationalizations

ExcuseReality
"The problem is simple, no need for process"Simple problems still have root causes. The process is fast for simple bugs.
"It's an emergency, no time for process"Systematic debugging is faster than guess-and-check whack-a-mole.
"Try it first, then investigate"The first fix sets the pattern. Do it right from the start.
"I'll write tests after confirming the fix works"Untested fixes don't last. Tests prove it first.
"Multiple fixes at once saves time"Can't isolate what works. Introduces new bugs.
"The reference is too long, I'll adapt the pattern"Partial understanding guarantees mistakes. Read it fully.
"I see the problem, let me fix it"Seeing the symptom ≠ understanding the root cause.
"Just try one more fix" (after 2+ failures)3+ failures = architectural problem. Problem with the pattern, stop fixing.

Quick Reference

PhaseMain ActivitiesSuccess Criteria
1. Root CauseRead errors, reproduce, check changes, gather evidenceUnderstand what and why
2. PatternFind working examples, compareIdentify differences
3. HypothesisForm theory, minimal testingConfirmed or new hypothesis
4. ImplementationCreate test, fix, verifyError resolved, tests pass

When the Process Shows "No Root Cause"

If systematic investigation shows the problem is truly environmental, time-dependent, or external:
  1. You've completed the process
  2. Document what you investigated
  3. Implement appropriate handling (retries, timeouts, error messages)
  4. Add monitoring/logging for future investigation
But: 95% of "no root cause" cases are incomplete investigations.

Companion Techniques

These techniques are part of systematic debugging and can be found in this directory:
  • root-cause-tracing.md
    - Trace errors backward through the call stack to find the original trigger
  • defense-in-depth.md
    - Add multiple layers of validation after finding the root cause
  • condition-based-waiting.md
    - Replace arbitrary timeouts with conditional polling
Related Skills:
  • superpowers:test-driven-development - For creating failing test cases (Phase 4, Step 1)
  • superpowers:verify-before-complete - Verify fixes work before declaring success

Real-World Impact

From debugging sessions:
  • Systematic approach: 15-30 minutes to fix
  • Random fix approach: 2-3 hours of whack-a-mole
  • First-fix success rate: 95% vs 40%
  • New bugs introduced: Near zero vs common