shared-bug-investigation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Bug Investigation - Scientific Method

Bug调查:科学方法

Apply scientific methodology to investigate and resolve software bugs systematically.
采用科学方法论系统性地调查并解决软件Bug。

Scientific Method Process

科学方法流程

  1. Observe - Gather data about the problem
  2. Hypothesize - Form testable explanations (ranked by likelihood)
  3. Experiment - Test hypotheses with controlled changes
  4. Analyze - Interpret results objectively
  5. Conclude - Identify root cause and validate fix
  1. 观察 - 收集问题相关数据
  2. 假设 - 形成可测试的解释(按可能性排序)
  3. 实验 - 通过受控变更测试假设
  4. 分析 - 客观解读结果
  5. 结论 - 确定根因并验证修复方案

Core Principles

核心原则

  • Context first - Understand the project before investigating
  • Hypothesis-driven - Never jump to solutions without forming testable hypotheses
  • Isolate variables - Change one thing at a time
  • Reproduce reliably - Can't fix what you can't reproduce
  • Root causes over symptoms - Dig deeper than surface fixes
  • Validate rigorously - Confirm fix resolves issue without regressions
  • 先掌握背景 - 调查前先了解项目情况
  • 基于假设驱动 - 未形成可测试假设前不要直接找解决方案
  • 隔离变量 - 每次只变更一个因素
  • 稳定复现 - 无法复现的问题难以修复
  • 聚焦根因而非症状 - 不要只修复表面问题,要深入根源
  • 严谨验证 - 确认修复方案解决问题且无回归

Investigation Workflow

调查工作流

Phase 1: Project Context (2-5 min)

阶段1:项目背景(2-5分钟)

Discover before investigating:
  • Language & version (Python 3.11, Java 17, Go 1.21, etc.)
  • Build system (Gradle, npm, Cargo, Make, etc.)
  • Key dependencies & frameworks
  • Architecture pattern (MVC, microservices, etc.)
  • Testing setup
Quick discovery:
bash
undefined
调查前先了解:
  • 语言及版本(Python 3.11、Java 17、Go 1.21等)
  • 构建系统(Gradle、npm、Cargo、Make等)
  • 关键依赖与框架
  • 架构模式(MVC、微服务等)
  • 测试环境设置
快速排查命令:
bash
undefined

Find package managers

查找包管理配置文件

view package.json / requirements.txt / Cargo.toml / pom.xml
view package.json / requirements.txt / Cargo.toml / pom.xml

Check config files

查看配置文件

view .env / config.yml / settings.py
view .env / config.yml / settings.py

Identify entry points

识别入口文件

view main.* / index.* / app.*

**Output: One-line context**
Python 3.11, Flask API, PostgreSQL, pytest, Docker
undefined
view main.* / index.* / app.*

**输出:单行背景总结**
Python 3.11, Flask API, PostgreSQL, pytest, Docker
undefined

Phase 2: Problem Definition

阶段2:问题定义

Gather:
  • Error messages (full text, codes)
  • Stack traces / logs
  • Steps to reproduce
  • Expected vs actual behavior
  • Environment (OS, version, config)
  • Reproducibility (always/sometimes/rare)
Document:
Bug: [Short description]
Reproduces: [Always/Sometimes/Unable]
Error: [Key error message]
Steps:
1. [Action 1]
2. [Action 2]
3. [Failure occurs]
Expected: [What should happen]
Actual: [What happens]
收集信息:
  • 错误信息(完整文本、错误码)
  • 堆栈跟踪/日志
  • 复现步骤
  • 预期与实际行为对比
  • 运行环境(操作系统、版本、配置)
  • 复现概率(总是/有时/极少)
文档记录:
Bug: [简短描述]
复现概率: [总是/有时/无法复现]
错误信息: [关键错误内容]
步骤:
1. [操作1]
2. [操作2]
3. [出现故障]
预期结果: [应该发生的情况]
实际结果: [实际发生的情况]

Phase 3: Hypotheses (Ranked)

阶段3:假设(按可能性排序)

Phase 3: Hypotheses (Ranked)

阶段3:假设(按可能性排序)

Form 2-4 testable hypotheses, ranked by likelihood.
H1: [Most likely cause]
  • Evidence for: [Why this is likely]
  • Test: [How to prove/disprove]
  • If true: [Expected result]
  • If false: [Expected result]
H2: [Alternative cause]
  • Evidence for: [Supporting observations]
  • Test: [Falsifiable experiment]
Common categories:
  • Logic errors (off-by-one, wrong operator, incorrect condition)
  • State issues (race condition, uninitialized, stale data)
  • Type/data (null/nil, type mismatch, parsing error)
  • Concurrency (data race, deadlock, thread safety)
  • Integration (API mismatch, version incompatibility)
  • Environment (config, platform-specific, resource limits)
形成2-4个可测试的假设,按可能性从高到低排序。
假设1:[最可能的原因]
  • 支持依据:[为何该原因可能性高]
  • 测试方法:[如何验证或推翻假设]
  • 若成立:[预期结果]
  • 若不成立:[预期结果]
假设2:[备选原因]
  • 支持依据:[相关观察结果]
  • 测试方法:[可证伪的实验方案]
常见问题分类:
  • 逻辑错误(边界值错误、运算符误用、条件判断错误)
  • 状态问题(竞态条件、未初始化变量、数据过期)
  • 类型/数据错误(空指针/空引用、类型不匹配、解析错误)
  • 并发问题(数据竞争、死锁、线程安全)
  • 集成问题(API不兼容、版本冲突)
  • 环境问题(配置错误、平台特定问题、资源限制)

Phase 4: Experiment

阶段4:实验

For each hypothesis:
Test H1: [Hypothesis name]
  • Change: [One variable to modify]
  • Measure: [What to observe]
  • Method: [Specific steps]
  • Result: [Actual outcome]
  • Conclusion: [Validated/Invalidated]
Techniques:
  • Add logging at key points
  • Use debugger breakpoints
  • Binary search (remove half the code)
  • Minimal reproduction (strip to essentials)
  • Diff working vs broken states
  • Isolate components
针对每个假设:
测试假设1:[假设名称]
  • 变更:[要修改的单个变量]
  • 观测指标:[需要观察的内容]
  • 操作步骤:[具体执行步骤]
  • 结果:[实际输出]
  • 结论:[假设成立/不成立]
常用技巧:
  • 在关键节点添加日志
  • 使用调试器断点
  • 二分排查(移除一半代码,测试Bug是否存在)
  • 最小复现案例(精简至50行以内的可复现代码,排除干扰)
  • 对比正常与异常状态的差异
  • 隔离组件测试

Phase 5: Root Cause

阶段5:根因分析

Identified: [Clear statement of actual cause]
Evidence: [Chain from observation → hypothesis → validation]
Why it occurred:
  • Immediate: [Technical reason]
  • Contributing: [What enabled this]
  • Systemic: [Deeper issue if any]
已确定根因: [清晰描述实际原因]
证据链: [从观察→假设→验证的完整链条]
问题产生的原因:
  • 直接原因:[技术层面的具体原因]
  • 间接原因:[导致问题发生的诱因]
  • 系统性原因:[若存在更深层的流程或架构问题]

Phase 6: Solution & Validation

阶段6:解决方案与验证

Fix: [Specific changes to make]
Why this works: [Explain causal connection]
Validation:
  1. Reproduce bug (confirm failure)
  2. Apply fix
  3. Retest (confirm success)
  4. Test edge cases
  5. Run test suite (no regressions)
  6. Add test for this bug
Prevention:
  • [Test to add]
  • [Assertion to include]
  • [Pattern to avoid]
修复方案: [具体修改内容]
修复原理: [解释修复方案与根因的因果关联]
验证流程:
  1. 复现Bug(确认故障存在)
  2. 应用修复方案
  3. 重新测试(确认问题解决)
  4. 测试边缘场景
  5. 运行全量测试套件(确保无回归)
  6. 新增针对该Bug的测试用例
预防措施:
  • [需新增的测试用例]
  • [需添加的断言]
  • [需避免的编码模式]

Investigation Techniques

调查技巧

Binary Search: Remove half the code, test if bug persists. Repeat on failing half until isolated.
Minimal Reproduction: Strip to <50 lines that reproduce issue. Removes noise.
Differential Testing: Compare working vs broken (commits, versions, configs).
Strategic Logging: Add prints at key decision points to trace execution flow.
Rubber Duck: Explain code line-by-line aloud. Often reveals logic errors.
二分排查: 移除一半代码,测试Bug是否仍存在。在存在Bug的代码段重复此操作,直至定位问题。
最小复现案例: 将代码精简至50行以内的可复现代码,排除无关干扰。
差异测试: 对比正常版本与异常版本的差异(提交记录、版本、配置)。
策略性日志: 在关键决策节点添加打印语句,追踪执行流程。
橡皮鸭调试法: 逐行向他人(或虚拟对象)解释代码逻辑,常能发现隐藏的逻辑错误。

Common Bug Patterns (Language-Agnostic)

通用Bug模式(跨语言)

Logic Errors:
  • Off-by-one:
    i < n
    vs
    i <= n
  • Wrong operator:
    &&
    vs
    ||
    ,
    ==
    vs
    =
  • Negation errors:
    !condition
    logic flipped
State Issues:
  • Race conditions: concurrent access without synchronization
  • Uninitialized: using variable before setting value
  • Stale state: using outdated cached data
Type/Data:
  • Null/nil dereference
  • Type coercion errors
  • Integer overflow
  • Floating-point precision
Concurrency:
  • Deadlock: mutual waiting for locks
  • Data race: unsynchronized shared access
  • Thread safety: non-thread-safe code on multiple threads
Integration:
  • API contract mismatch
  • Version incompatibility
  • Missing dependencies
  • Incorrect configuration
逻辑错误:
  • 边界值错误:
    i < n
    i <= n
    混淆
  • 运算符误用:
    &&
    ||
    ==
    =
    混淆
  • 否定逻辑错误:
    !condition
    逻辑反转错误
状态问题:
  • 竞态条件:无同步机制的并发访问
  • 未初始化:使用未赋值的变量
  • 数据过期:使用缓存中的过期数据
类型/数据错误:
  • 空指针/空引用解引用
  • 类型转换错误
  • 整数溢出
  • 浮点数精度问题
并发问题:
  • 死锁:线程间互相等待锁资源
  • 数据竞争:无同步的共享资源访问
  • 线程安全:非线程安全代码在多线程环境运行
集成问题:
  • API契约不匹配
  • 版本兼容性问题
  • 依赖缺失
  • 配置错误

Output Format

输出格式

undefined
undefined

Bug Investigation: [Name]

Bug调查:[问题名称]

Context

项目背景

[One-line: Language, framework, architecture]
[单行总结:语言、框架、架构]

Problem

问题描述

Error: [Message] Reproduces: [Always/Sometimes] Steps: [1,2,3]
错误信息:[具体内容] 复现概率:[总是/有时] 复现步骤:[1,2,3]

Hypotheses

假设

H1: [Most likely] - Test by [method] H2: [Alternative] - Test by [method]
假设1:[最可能原因] - 测试方法:[具体方案] 假设2:[备选原因] - 测试方法:[具体方案]

Investigation

调查过程

Tested H1: [Result - validated/invalidated] [If needed] Tested H2: [Result]
已测试假设1:[结果 - 成立/不成立] [若需要] 已测试假设2:[结果]

Root Cause

根因分析

[One sentence explanation] Evidence: [What confirmed it]
[一句话解释] 证据:[验证依据]

Solution

解决方案

Fix: [Specific change] Why it works: [Explanation] Validated: [Tested successfully]
修复方案:[具体修改内容] 修复原理:[解释] 验证结果:[测试通过]

Prevention

预防措施

  • Test added: [Description]
  • Warning signs: [What to watch for]
undefined
  • 新增测试用例:[描述]
  • 预警信号:[需要关注的异常现象]
undefined

Quick Decision Tree

快速决策树

Symptom → Likely Category:
  • Intermittent failure → Concurrency/state
  • Always fails same way → Logic error
  • Null/nil crash → Type/data
  • Specific environment only → Configuration
  • Performance degradation → Resource/algorithm
  • After dependency update → Integration
症状 → 可能的问题分类:
  • 间歇性故障 → 并发/状态问题
  • 稳定复现同一故障 → 逻辑错误
  • 空指针/空引用崩溃 → 类型/数据错误
  • 仅特定环境出现 → 配置问题
  • 性能下降 → 资源/算法问题
  • 依赖更新后出现 → 集成问题

Critical Reminders

重要提醒

  • Start with context discovery
  • Form hypotheses before coding
  • Change ONE variable at a time
  • Reproduce before fixing
  • Validate fix rigorously
  • Add regression test
  • Document root cause
  • 从了解项目背景开始
  • 先形成假设再编写代码
  • 每次只变更一个变量
  • 修复前先稳定复现问题
  • 严谨验证修复方案
  • 新增回归测试用例
  • 记录根因分析结果