Troubleshooting Guide Creator

故障排查指南创建工具

Эксперт по созданию структурированных руководств по диагностике и устранению проблем.

结构化故障诊断与排查指南创建专家。

Core Principles

核心原则

Problem-Centric Structure

以问题为中心的结构

yaml

troubleshooting_principles:
  - principle: "Start with clear problem statements and symptoms"
    reason: "Users need to quickly identify if guide applies to their issue"

  - principle: "Use If-Then logic flows for decision trees"
    reason: "Systematic elimination of possible causes"

  - principle: "Organize solutions by likelihood and impact"
    reason: "Try simple/common fixes first, escalate to complex"

  - principle: "Follow logical diagnostic sequence (simple to complex)"
    reason: "Minimize time to resolution"

  - principle: "Include verification steps after each fix"
    reason: "Confirm the issue is actually resolved"

  - principle: "Provide rollback instructions"
    reason: "Allow safe recovery if fix causes new issues"

yaml

troubleshooting_principles:
  - principle: "Start with clear problem statements and symptoms"
    reason: "Users need to quickly identify if guide applies to their issue"

  - principle: "Use If-Then logic flows for decision trees"
    reason: "Systematic elimination of possible causes"

  - principle: "Organize solutions by likelihood and impact"
    reason: "Try simple/common fixes first, escalate to complex"

  - principle: "Follow logical diagnostic sequence (simple to complex)"
    reason: "Minimize time to resolution"

  - principle: "Include verification steps after each fix"
    reason: "Confirm the issue is actually resolved"

  - principle: "Provide rollback instructions"
    reason: "Allow safe recovery if fix causes new issues"

User Experience Focus

用户体验聚焦

Пиши для целевого уровня аудитории
Используй консистентное форматирование
Указывай оценочное время для каждого шага
Включай скриншоты и примеры где возможно

针对目标受众水平撰写
使用统一格式
标注每个步骤的预估时间
尽可能包含截图和示例

Standard Guide Template

标准指南模板

markdown

undefined

markdown

undefined

Troubleshooting: [Problem Title]

Last Updated: [Date] Applies To: [Product/Service/Version] Difficulty: Beginner | Intermediate | Advanced Time Estimate: X-Y minutes

Problem Statement

Symptoms

Users experiencing this issue will observe:

Symptom 1 (observable behavior)
Symptom 2 (error message or code)
Symptom 3 (system state)

Users experiencing this issue will observe:

Symptom 1 (observable behavior)
Symptom 2 (error message or code)
Symptom 3 (system state)

Error Messages

[Exact error message or code]

[Exact error message or code]

Affected Components

Component A
Component B

Component A
Component B

Impact

Severity: Critical | High | Medium | Low
Affected Users: All users | Specific group | Single user
Business Impact: [Description]

Severity: Critical | High | Medium | Low
Affected Users: All users | Specific group | Single user
Business Impact: [Description]

Quick Checks (2-5 minutes)

Before diving into detailed troubleshooting, verify these common causes:

Check 1: [Most Common Cause]

Time: 30 seconds

bash

undefined

Time: 30 seconds

bash

undefined

Command to verify

[diagnostic command]


**Expected Output:** [What you should see]
**If this fails:** Continue to Check 2

[diagnostic command]


**Expected Output:** [What you should see]
**If this fails:** Continue to Check 2

Check 2: [Second Most Common Cause]

Time: 1 minute

[Steps to verify]

Time: 1 minute

[Steps to verify]

Diagnostic Steps

Step 1: Gather Information

Collect the following before proceeding:

Error logs from [location]
System configuration from [location]
User actions that triggered the issue

bash

undefined

Collect the following before proceeding:

Error logs from [location]
System configuration from [location]
User actions that triggered the issue

bash

undefined

Commands to gather diagnostic info

[command 1] [command 2]

undefined

[command 1] [command 2]

undefined

Step 2: Identify the Root Cause

Use this decision tree to identify the cause:

Start
  │
  ├─ Is [condition A] true?
  │    ├─ YES → Go to Solution A
  │    └─ NO → Continue
  │
  ├─ Is [condition B] true?
  │    ├─ YES → Go to Solution B
  │    └─ NO → Continue
  │
  └─ None of the above → Escalate to Support

Use this decision tree to identify the cause:

Start
  │
  ├─ Is [condition A] true?
  │    ├─ YES → Go to Solution A
  │    └─ NO → Continue
  │
  ├─ Is [condition B] true?
  │    ├─ YES → Go to Solution B
  │    └─ NO → Continue
  │
  └─ None of the above → Escalate to Support

Solutions

Solution A: [Fix Name]

Difficulty: Easy Time: 5 minutes Risk: Low

Prerequisites

[Prerequisite 1]
[Prerequisite 2]

[Prerequisite 1]
[Prerequisite 2]

Steps

Step 1 Title
bash
```
[command]
```
Expected output: [description]
Step 2 Title [Instructions]
Step 3 Title [Instructions]

Step 1 Title
bash
```
[command]
```
Expected output: [description]
Step 2 Title [Instructions]
Step 3 Title [Instructions]

Verify Fix

bash

[verification command]

Success Indicator: [What to look for]

bash

[verification command]

Success Indicator: [What to look for]

Rollback (if needed)

bash

[rollback command]

bash

[rollback command]

Solution B: [Fix Name]

Difficulty: Medium Time: 15 minutes Risk: Medium

[Same structure as Solution A]

Difficulty: Medium Time: 15 minutes Risk: Medium

[Same structure as Solution A]

Prevention

To prevent this issue from recurring:

Monitoring: Set up alerts for [metric]
Configuration: Ensure [setting] is properly configured
Process: Follow [procedure] when making changes
Training: Educate team on [best practice]

To prevent this issue from recurring:

Monitoring: Set up alerts for [metric]
Configuration: Ensure [setting] is properly configured
Process: Follow [procedure] when making changes
Training: Educate team on [best practice]

Escalation

If the above solutions don't resolve the issue:

When to Escalate

Issue persists after trying all solutions
Data loss or security concern identified
Multiple users affected simultaneously

Issue persists after trying all solutions
Data loss or security concern identified
Multiple users affected simultaneously

Information to Provide

Time issue started
Steps already attempted
Diagnostic logs collected
Business impact assessment

Time issue started
Steps already attempted
Diagnostic logs collected
Business impact assessment

Contact

Support Team: [Contact info]
Escalation Path: [Who to contact]
SLA: [Expected response time]

Support Team: [Contact info]
Escalation Path: [Who to contact]
SLA: [Expected response time]

Related Resources

[Link to related guide]
[Link to documentation]
[Link to FAQ]

[Link to related guide]
[Link to documentation]
[Link to FAQ]

Revision History

Date	Author	Changes
[Date]	[Name]	Initial version
[Date]	[Name]	Added Solution C

---

Date	Author	Changes
[Date]	[Name]	Initial version
[Date]	[Name]	Added Solution C

---

Diagnostic Patterns

诊断模式

Layer-by-Layer Approach

分层排查法

markdown

undefined

markdown

undefined

Network Connectivity Troubleshooting

Layer 1: Physical

Check cable connections
Verify link lights are active
Test with known-good cable

Check cable connections
Verify link lights are active
Test with known-good cable

Layer 2: Data Link

Verify MAC address is learned
Check for VLAN misconfigurations
Review spanning tree state

Verify MAC address is learned
Check for VLAN misconfigurations
Review spanning tree state

Layer 3: Network

Verify IP configuration
Test ping to gateway
Check routing table

Verify IP configuration
Test ping to gateway
Check routing table

Layer 4: Transport

Verify service is listening on correct port
Check firewall rules
Test with telnet/nc to port

Verify service is listening on correct port
Check firewall rules
Test with telnet/nc to port

Layer 7: Application

Check application logs
Verify configuration files
Test with curl/wget

undefined

Check application logs
Verify configuration files
Test with curl/wget

undefined

Binary Elimination Method

二分排查法

markdown

undefined

markdown

undefined

Identifying Faulty Component

Use binary search to isolate the issue:

Step 1: Test Midpoint

Test the system at the midpoint of the data flow:

[Client] → [Load Balancer] → [App Server] → [Database]
                  ↑
            Test here first

If working at midpoint: Issue is between midpoint and client If failing at midpoint: Issue is between midpoint and database

Test the system at the midpoint of the data flow:

[Client] → [Load Balancer] → [App Server] → [Database]
                  ↑
            Test here first

If working at midpoint: Issue is between midpoint and client If failing at midpoint: Issue is between midpoint and database

Step 2: Narrow Down

Repeat the process, testing the midpoint of the remaining segment.

Step 3: Isolate

Continue until you've identified the specific failing component.

undefined

Continue until you've identified the specific failing component.

undefined

Symptom-Based Decision Tree

基于症状的决策树

markdown

undefined

markdown

undefined

Application Not Responding

┌─ Can you reach the server at all?
│
├─ NO → Network/DNS Issue
│       └─ Go to: Network Troubleshooting Guide
│
└─ YES → Continue
        │
        ├─ Does the service port respond?
        │
        ├─ NO → Service Not Running
        │       └─ Go to: Service Restart Procedure
        │
        └─ YES → Continue
                │
                ├─ Are there errors in application logs?
                │
                ├─ YES → Application Error
                │       └─ Go to: Log Analysis Guide
                │
                └─ NO → Resource Exhaustion
                        └─ Go to: Performance Troubleshooting

---

┌─ Can you reach the server at all?
│
├─ NO → Network/DNS Issue
│       └─ Go to: Network Troubleshooting Guide
│
└─ YES → Continue
        │
        ├─ Does the service port respond?
        │
        ├─ NO → Service Not Running
        │       └─ Go to: Service Restart Procedure
        │
        └─ YES → Continue
                │
                ├─ Are there errors in application logs?
                │
                ├─ YES → Application Error
                │       └─ Go to: Log Analysis Guide
                │
                └─ NO → Resource Exhaustion
                        └─ Go to: Performance Troubleshooting

---

Log Analysis Guide

日志分析指南

Common Log Locations

常见日志位置

yaml

linux_logs:
  system:
    - /var/log/syslog
    - /var/log/messages
    - journalctl -xe

  application:
    - /var/log/[app-name]/
    - ~/.pm2/logs/
    - docker logs [container]

  web_server:
    nginx:
      - /var/log/nginx/error.log
      - /var/log/nginx/access.log
    apache:
      - /var/log/apache2/error.log
      - /var/log/httpd/error_log

  database:
    postgresql:
      - /var/log/postgresql/
    mysql:
      - /var/log/mysql/error.log

yaml

linux_logs:
  system:
    - /var/log/syslog
    - /var/log/messages
    - journalctl -xe

  application:
    - /var/log/[app-name]/
    - ~/.pm2/logs/
    - docker logs [container]

  web_server:
    nginx:
      - /var/log/nginx/error.log
      - /var/log/nginx/access.log
    apache:
      - /var/log/apache2/error.log
      - /var/log/httpd/error_log

  database:
    postgresql:
      - /var/log/postgresql/
    mysql:
      - /var/log/mysql/error.log

Log Analysis Commands

日志分析命令

bash

undefined

bash

undefined

Find errors in last 100 lines

tail -100 /var/log/app.log | grep -i error

Find errors with timestamp

grep -i error /var/log/app.log | tail -50

Watch log in real-time

tail -f /var/log/app.log | grep --line-buffered -i error

Count errors by type

grep -i error /var/log/app.log | sort | uniq -c | sort -rn | head -20

Find entries around specific time

awk '/2024-01-15 14:3[0-5]/' /var/log/app.log

Extract specific fields (JSON logs)

cat /var/log/app.json | jq 'select(.level == "error") | {time, message}'

Search compressed logs

zgrep -i error /var/log/app.log.*.gz

undefined

zgrep -i error /var/log/app.log.*.gz

undefined

Error Pattern Recognition

错误模式识别

markdown

undefined

markdown

undefined

Common Error Patterns

Connection Errors

Pattern: "Connection refused" | "ECONNREFUSED" | "Connection timed out"
Cause: Service not running or firewall blocking
Fix: Check service status, verify port, check firewall rules

Pattern: "Connection refused" | "ECONNREFUSED" | "Connection timed out"
Cause: Service not running or firewall blocking
Fix: Check service status, verify port, check firewall rules

Memory Errors

Pattern: "Out of memory" | "OOM" | "Cannot allocate memory"
Cause: Process exhausting available RAM
Fix: Increase memory, optimize application, add swap

Pattern: "Out of memory" | "OOM" | "Cannot allocate memory"
Cause: Process exhausting available RAM
Fix: Increase memory, optimize application, add swap

Disk Errors

Pattern: "No space left on device" | "ENOSPC" | "Disk full"
Cause: Filesystem at capacity
Fix: Clean old files, increase disk, enable log rotation

Pattern: "No space left on device" | "ENOSPC" | "Disk full"
Cause: Filesystem at capacity
Fix: Clean old files, increase disk, enable log rotation

Permission Errors

Pattern: "Permission denied" | "EACCES" | "Operation not permitted"
Cause: Insufficient file/directory permissions
Fix: Check ownership, verify permissions, check SELinux/AppArmor

Pattern: "Permission denied" | "EACCES" | "Operation not permitted"
Cause: Insufficient file/directory permissions
Fix: Check ownership, verify permissions, check SELinux/AppArmor

Database Errors

Pattern: "Too many connections" | "Connection pool exhausted"
Cause: Connection leak or undersized pool
Fix: Close unused connections, increase pool size, fix leaks

---

Pattern: "Too many connections" | "Connection pool exhausted"
Cause: Connection leak or undersized pool
Fix: Close unused connections, increase pool size, fix leaks

---

Specific Problem Templates

特定问题模板

API Not Responding

API无响应

markdown

undefined

markdown

undefined

Troubleshooting: API Not Responding

Quick Diagnosis Script

bash

#!/bin/bash

bash

#!/bin/bash

api-health-check.sh

API_URL="${1:-http://localhost:8080}" TIMEOUT=5

echo "=== API Health Check ===" echo "Target: $API_URL" echo

API_URL="${1:-http://localhost:8080}" TIMEOUT=5

echo "=== API Health Check ===" echo "Target: $API_URL" echo

1. DNS Resolution

echo "1. DNS Resolution..." if host=$(dig +short $(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1) 2>/dev/null); then echo " ✅ DNS resolves to: $host" else echo " ❌ DNS resolution failed" fi

2. Port Connectivity

echo "2. Port Connectivity..." PORT=$(echo $API_URL | grep -oP ':\K[0-9]+' || echo "80") HOST=$(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1) if nc -z -w $TIMEOUT $HOST $PORT 2>/dev/null; then echo " ✅ Port $PORT is open" else echo " ❌ Port $PORT is not reachable" fi

3. HTTP Response

echo "3. HTTP Response..." HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null) if [ "$HTTP_CODE" = "200" ]; then echo " ✅ Health endpoint returns 200" elif [ -n "$HTTP_CODE" ] && [ "$HTTP_CODE" != "000" ]; then echo " ⚠️ Health endpoint returns $HTTP_CODE" else echo " ❌ No HTTP response" fi

4. Response Time

echo "4. Response Time..." RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null) if (( $(echo "$RESPONSE_TIME < 1" | bc -l) )); then echo " ✅ Response time: ${RESPONSE_TIME}s" else echo " ⚠️ Slow response: ${RESPONSE_TIME}s" fi

echo echo "=== Check Complete ==="

undefined

echo "4. Response Time..." RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null) if (( $(echo "$RESPONSE_TIME < 1" | bc -l) )); then echo " ✅ Response time: ${RESPONSE_TIME}s" else echo " ⚠️ Slow response: ${RESPONSE_TIME}s" fi

echo echo "=== Check Complete ==="

undefined

Decision Tree

API Not Responding
       │
       ├─ Can you ping the server?
       │   ├─ NO → Check network/DNS
       │   └─ YES ↓
       │
       ├─ Is the service running?
       │   ├─ NO → Start/restart service
       │   └─ YES ↓
       │
       ├─ Is the port listening?
       │   ├─ NO → Check service configuration
       │   └─ YES ↓
       │
       ├─ Does health check pass?
       │   ├─ NO → Check dependencies (DB, cache)
       │   └─ YES ↓
       │
       └─ Check application logs for errors

undefined

API Not Responding
       │
       ├─ Can you ping the server?
       │   ├─ NO → Check network/DNS
       │   └─ YES ↓
       │
       ├─ Is the service running?
       │   ├─ NO → Start/restart service
       │   └─ YES ↓
       │
       ├─ Is the port listening?
       │   ├─ NO → Check service configuration
       │   └─ YES ↓
       │
       ├─ Does health check pass?
       │   ├─ NO → Check dependencies (DB, cache)
       │   └─ YES ↓
       │
       └─ Check application logs for errors

undefined

Database Connection Issues

数据库连接问题

markdown

undefined

markdown

undefined

Troubleshooting: Database Connection Failed

Symptoms

Application shows "Connection refused" or "Connection timed out"
Error: "FATAL: too many connections for role"
Error: "FATAL: password authentication failed"

Application shows "Connection refused" or "Connection timed out"
Error: "FATAL: too many connections for role"
Error: "FATAL: password authentication failed"

Quick Checks

1. Verify Database is Running

bash

undefined

bash

undefined

PostgreSQL

sudo systemctl status postgresql pg_isready -h localhost -p 5432

MySQL

sudo systemctl status mysql mysqladmin -u root -p ping

undefined

sudo systemctl status mysql mysqladmin -u root -p ping

undefined

2. Test Connection

bash

undefined

bash

undefined

PostgreSQL

psql -h localhost -U username -d database -c "SELECT 1"

MySQL

mysql -h localhost -u username -p -e "SELECT 1"

undefined

mysql -h localhost -u username -p -e "SELECT 1"

undefined

3. Check Connection Count

sql

-- PostgreSQL
SELECT count(*) FROM pg_stat_activity;
SELECT max_connections FROM pg_settings WHERE name = 'max_connections';

-- MySQL
SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';

sql

-- PostgreSQL
SELECT count(*) FROM pg_stat_activity;
SELECT max_connections FROM pg_settings WHERE name = 'max_connections';

-- MySQL
SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';

Solutions

Solution 1: Restart Connection Pool

bash

undefined

bash

undefined

If using PgBouncer

sudo systemctl restart pgbouncer

Application restart

sudo systemctl restart myapp

undefined

sudo systemctl restart myapp

undefined

Solution 2: Clear Idle Connections

sql

-- PostgreSQL: Kill idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND state_change < NOW() - INTERVAL '10 minutes';

sql

-- PostgreSQL: Kill idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND state_change < NOW() - INTERVAL '10 minutes';

Solution 3: Increase Max Connections

sql

-- PostgreSQL (requires restart)
ALTER SYSTEM SET max_connections = 200;

-- MySQL (can be done live)
SET GLOBAL max_connections = 200;

---

sql

-- PostgreSQL (requires restart)
ALTER SYSTEM SET max_connections = 200;

-- MySQL (can be done live)
SET GLOBAL max_connections = 200;

---

Quality Assurance Checklist

质量保证检查清单

Pre-Publication Review

发布前审核

markdown

undefined

markdown

undefined

Troubleshooting Guide Quality Checklist

Accuracy

All commands tested and verified
Output examples are accurate
Links to resources are valid
Version numbers are current

All commands tested and verified
Output examples are accurate
Links to resources are valid
Version numbers are current

Completeness

All common causes covered
Rollback instructions provided
Escalation path defined
Prevention tips included

All common causes covered
Rollback instructions provided
Escalation path defined
Prevention tips included

Usability

Clear success/failure criteria
Time estimates accurate
Difficulty levels appropriate
Tested by someone unfamiliar with issue

Clear success/failure criteria
Time estimates accurate
Difficulty levels appropriate
Tested by someone unfamiliar with issue

Formatting

Consistent heading structure
Code blocks properly formatted
Decision trees clear
Screenshots/diagrams where helpful

Consistent heading structure
Code blocks properly formatted
Decision trees clear
Screenshots/diagrams where helpful

Maintenance

Last updated date included
Revision history maintained
Owner/contact identified
Review schedule established

---

Last updated date included
Revision history maintained
Owner/contact identified
Review schedule established

---

Лучшие практики

最佳实践

Начинай с симптомов — пользователь должен быстро понять, подходит ли гайд
Простое решение первым — проверь очевидные причины до сложной диагностики
Включай verification steps — как понять, что проблема решена
Документируй rollback — возможность отката если fix не помог
Указывай время — пользователь должен знать сколько займёт каждый шаг
Тестируй на новичках — гайд должен работать для тех, кто не знает систему
Обновляй регулярно — устаревший гайд хуже чем его отсутствие
Включай escalation path — когда и к кому обращаться

从症状入手——用户需快速判断指南是否适用于其问题
简单方案优先——在进行复杂诊断前先检查明显原因
包含验证步骤——明确如何确认问题已解决
记录回滚步骤——若修复无效，可进行回滚
标注时间——用户需了解每个步骤所需时长
在新手身上测试——指南需对不熟悉系统的用户友好
定期更新——过时的指南比没有指南更糟
包含升级路径——明确何时及向谁求助

troubleshooting-guide

Original

Translation

Troubleshooting Guide Creator

故障排查指南创建工具

Core Principles

核心原则

Problem-Centric Structure

以问题为中心的结构

User Experience Focus

用户体验聚焦

Standard Guide Template

标准指南模板

Troubleshooting: [Problem Title]

Troubleshooting: [Problem Title]

Problem Statement

Problem Statement

Symptoms

Symptoms

Error Messages

Error Messages

Affected Components

Affected Components

Impact

Impact

Quick Checks (2-5 minutes)

Quick Checks (2-5 minutes)

Check 1: [Most Common Cause]

Check 1: [Most Common Cause]

Command to verify

Command to verify

Check 2: [Second Most Common Cause]

Check 2: [Second Most Common Cause]

Diagnostic Steps

Diagnostic Steps

Step 1: Gather Information

Step 1: Gather Information

Commands to gather diagnostic info

Commands to gather diagnostic info

Step 2: Identify the Root Cause

Step 2: Identify the Root Cause

Solutions

Solutions

Solution A: [Fix Name]

Solution A: [Fix Name]

Prerequisites

Prerequisites

Steps

Steps

Verify Fix

Verify Fix

Rollback (if needed)

Rollback (if needed)

Solution B: [Fix Name]

Solution B: [Fix Name]

Prevention

Prevention

Escalation

Escalation

When to Escalate

When to Escalate

Information to Provide

Information to Provide

Contact

Contact

Related Resources

Related Resources

Revision History

Revision History

Diagnostic Patterns

诊断模式

Layer-by-Layer Approach

分层排查法

Network Connectivity Troubleshooting

Network Connectivity Troubleshooting

Layer 1: Physical

Layer 1: Physical

Layer 2: Data Link

Layer 2: Data Link

Layer 3: Network