Troubleshooting Guide Creator
故障排查指南创建工具
Эксперт по созданию структурированных руководств по диагностике и устранению проблем.
Problem-Centric Structure
以问题为中心的结构
yaml
troubleshooting_principles:
- principle: "Start with clear problem statements and symptoms"
reason: "Users need to quickly identify if guide applies to their issue"
- principle: "Use If-Then logic flows for decision trees"
reason: "Systematic elimination of possible causes"
- principle: "Organize solutions by likelihood and impact"
reason: "Try simple/common fixes first, escalate to complex"
- principle: "Follow logical diagnostic sequence (simple to complex)"
reason: "Minimize time to resolution"
- principle: "Include verification steps after each fix"
reason: "Confirm the issue is actually resolved"
- principle: "Provide rollback instructions"
reason: "Allow safe recovery if fix causes new issues"
yaml
troubleshooting_principles:
- principle: "Start with clear problem statements and symptoms"
reason: "Users need to quickly identify if guide applies to their issue"
- principle: "Use If-Then logic flows for decision trees"
reason: "Systematic elimination of possible causes"
- principle: "Organize solutions by likelihood and impact"
reason: "Try simple/common fixes first, escalate to complex"
- principle: "Follow logical diagnostic sequence (simple to complex)"
reason: "Minimize time to resolution"
- principle: "Include verification steps after each fix"
reason: "Confirm the issue is actually resolved"
- principle: "Provide rollback instructions"
reason: "Allow safe recovery if fix causes new issues"
User Experience Focus
用户体验聚焦
- Пиши для целевого уровня аудитории
- Используй консистентное форматирование
- Указывай оценочное время для каждого шага
- Включай скриншоты и примеры где возможно
- 针对目标受众水平撰写
- 使用统一格式
- 标注每个步骤的预估时间
- 尽可能包含截图和示例
Standard Guide Template
标准指南模板
Troubleshooting: [Problem Title]
Troubleshooting: [Problem Title]
Last Updated: [Date]
Applies To: [Product/Service/Version]
Difficulty: Beginner | Intermediate | Advanced
Time Estimate: X-Y minutes
Last Updated: [Date]
Applies To: [Product/Service/Version]
Difficulty: Beginner | Intermediate | Advanced
Time Estimate: X-Y minutes
Problem Statement
Problem Statement
Users experiencing this issue will observe:
Users experiencing this issue will observe:
Error Messages
Error Messages
[Exact error message or code]
[Exact error message or code]
Affected Components
Affected Components
- Severity: Critical | High | Medium | Low
- Affected Users: All users | Specific group | Single user
- Business Impact: [Description]
- Severity: Critical | High | Medium | Low
- Affected Users: All users | Specific group | Single user
- Business Impact: [Description]
Quick Checks (2-5 minutes)
Quick Checks (2-5 minutes)
Before diving into detailed troubleshooting, verify these common causes:
Before diving into detailed troubleshooting, verify these common causes:
Check 1: [Most Common Cause]
Check 1: [Most Common Cause]
Command to verify
Command to verify
[diagnostic command]
**Expected Output:** [What you should see]
**If this fails:** Continue to Check 2
[diagnostic command]
**Expected Output:** [What you should see]
**If this fails:** Continue to Check 2
Check 2: [Second Most Common Cause]
Check 2: [Second Most Common Cause]
Time: 1 minute
[Steps to verify]
Time: 1 minute
[Steps to verify]
Diagnostic Steps
Diagnostic Steps
Step 1: Gather Information
Step 1: Gather Information
Collect the following before proceeding:
Collect the following before proceeding:
Commands to gather diagnostic info
Commands to gather diagnostic info
Step 2: Identify the Root Cause
Step 2: Identify the Root Cause
Use this decision tree to identify the cause:
Start
│
├─ Is [condition A] true?
│ ├─ YES → Go to Solution A
│ └─ NO → Continue
│
├─ Is [condition B] true?
│ ├─ YES → Go to Solution B
│ └─ NO → Continue
│
└─ None of the above → Escalate to Support
Use this decision tree to identify the cause:
Start
│
├─ Is [condition A] true?
│ ├─ YES → Go to Solution A
│ └─ NO → Continue
│
├─ Is [condition B] true?
│ ├─ YES → Go to Solution B
│ └─ NO → Continue
│
└─ None of the above → Escalate to Support
Solution A: [Fix Name]
Solution A: [Fix Name]
Difficulty: Easy
Time: 5 minutes
Risk: Low
Difficulty: Easy
Time: 5 minutes
Risk: Low
Prerequisites
Prerequisites
-
Step 1 Title
Expected output: [description]
-
Step 2 Title
[Instructions]
-
Step 3 Title
[Instructions]
-
Step 1 Title
Expected output: [description]
-
Step 2 Title
[Instructions]
-
Step 3 Title
[Instructions]
bash
[verification command]
Success Indicator: [What to look for]
bash
[verification command]
Success Indicator: [What to look for]
Rollback (if needed)
Rollback (if needed)
Solution B: [Fix Name]
Solution B: [Fix Name]
Difficulty: Medium
Time: 15 minutes
Risk: Medium
[Same structure as Solution A]
Difficulty: Medium
Time: 15 minutes
Risk: Medium
[Same structure as Solution A]
To prevent this issue from recurring:
- Monitoring: Set up alerts for [metric]
- Configuration: Ensure [setting] is properly configured
- Process: Follow [procedure] when making changes
- Training: Educate team on [best practice]
To prevent this issue from recurring:
- Monitoring: Set up alerts for [metric]
- Configuration: Ensure [setting] is properly configured
- Process: Follow [procedure] when making changes
- Training: Educate team on [best practice]
If the above solutions don't resolve the issue:
If the above solutions don't resolve the issue:
When to Escalate
When to Escalate
Information to Provide
Information to Provide
- Support Team: [Contact info]
- Escalation Path: [Who to contact]
- SLA: [Expected response time]
- Support Team: [Contact info]
- Escalation Path: [Who to contact]
- SLA: [Expected response time]
Related Resources
Related Resources
- [Link to related guide]
- [Link to documentation]
- [Link to FAQ]
- [Link to related guide]
- [Link to documentation]
- [Link to FAQ]
Revision History
Revision History
| Date | Author | Changes |
|---|
| [Date] | [Name] | Initial version |
| [Date] | [Name] | Added Solution C |
| Date | Author | Changes |
|---|
| [Date] | [Name] | Initial version |
| [Date] | [Name] | Added Solution C |
Layer-by-Layer Approach
分层排查法
Network Connectivity Troubleshooting
Network Connectivity Troubleshooting
Layer 1: Physical
Layer 1: Physical
Layer 2: Data Link
Layer 2: Data Link
Layer 3: Network
Layer 3: Network
Layer 4: Transport
Layer 4: Transport
Layer 7: Application
Layer 7: Application
Binary Elimination Method
二分排查法
Identifying Faulty Component
Identifying Faulty Component
Use binary search to isolate the issue:
Use binary search to isolate the issue:
Step 1: Test Midpoint
Step 1: Test Midpoint
Test the system at the midpoint of the data flow:
[Client] → [Load Balancer] → [App Server] → [Database]
↑
Test here first
If working at midpoint: Issue is between midpoint and client
If failing at midpoint: Issue is between midpoint and database
Test the system at the midpoint of the data flow:
[Client] → [Load Balancer] → [App Server] → [Database]
↑
Test here first
If working at midpoint: Issue is between midpoint and client
If failing at midpoint: Issue is between midpoint and database
Step 2: Narrow Down
Step 2: Narrow Down
Repeat the process, testing the midpoint of the remaining segment.
Repeat the process, testing the midpoint of the remaining segment.
Step 3: Isolate
Step 3: Isolate
Continue until you've identified the specific failing component.
Continue until you've identified the specific failing component.
Symptom-Based Decision Tree
基于症状的决策树
Application Not Responding
Application Not Responding
┌─ Can you reach the server at all?
│
├─ NO → Network/DNS Issue
│ └─ Go to: Network Troubleshooting Guide
│
└─ YES → Continue
│
├─ Does the service port respond?
│
├─ NO → Service Not Running
│ └─ Go to: Service Restart Procedure
│
└─ YES → Continue
│
├─ Are there errors in application logs?
│
├─ YES → Application Error
│ └─ Go to: Log Analysis Guide
│
└─ NO → Resource Exhaustion
└─ Go to: Performance Troubleshooting
┌─ Can you reach the server at all?
│
├─ NO → Network/DNS Issue
│ └─ Go to: Network Troubleshooting Guide
│
└─ YES → Continue
│
├─ Does the service port respond?
│
├─ NO → Service Not Running
│ └─ Go to: Service Restart Procedure
│
└─ YES → Continue
│
├─ Are there errors in application logs?
│
├─ YES → Application Error
│ └─ Go to: Log Analysis Guide
│
└─ NO → Resource Exhaustion
└─ Go to: Performance Troubleshooting
Common Log Locations
常见日志位置
yaml
linux_logs:
system:
- /var/log/syslog
- /var/log/messages
- journalctl -xe
application:
- /var/log/[app-name]/
- ~/.pm2/logs/
- docker logs [container]
web_server:
nginx:
- /var/log/nginx/error.log
- /var/log/nginx/access.log
apache:
- /var/log/apache2/error.log
- /var/log/httpd/error_log
database:
postgresql:
- /var/log/postgresql/
mysql:
- /var/log/mysql/error.log
yaml
linux_logs:
system:
- /var/log/syslog
- /var/log/messages
- journalctl -xe
application:
- /var/log/[app-name]/
- ~/.pm2/logs/
- docker logs [container]
web_server:
nginx:
- /var/log/nginx/error.log
- /var/log/nginx/access.log
apache:
- /var/log/apache2/error.log
- /var/log/httpd/error_log
database:
postgresql:
- /var/log/postgresql/
mysql:
- /var/log/mysql/error.log
Log Analysis Commands
日志分析命令
Find errors in last 100 lines
Find errors in last 100 lines
tail -100 /var/log/app.log | grep -i error
tail -100 /var/log/app.log | grep -i error
Find errors with timestamp
Find errors with timestamp
grep -i error /var/log/app.log | tail -50
grep -i error /var/log/app.log | tail -50
Watch log in real-time
Watch log in real-time
tail -f /var/log/app.log | grep --line-buffered -i error
tail -f /var/log/app.log | grep --line-buffered -i error
Count errors by type
Count errors by type
grep -i error /var/log/app.log | sort | uniq -c | sort -rn | head -20
grep -i error /var/log/app.log | sort | uniq -c | sort -rn | head -20
Find entries around specific time
Find entries around specific time
awk '/2024-01-15 14:3[0-5]/' /var/log/app.log
awk '/2024-01-15 14:3[0-5]/' /var/log/app.log
Extract specific fields (JSON logs)
Extract specific fields (JSON logs)
cat /var/log/app.json | jq 'select(.level == "error") | {time, message}'
cat /var/log/app.json | jq 'select(.level == "error") | {time, message}'
Search compressed logs
Search compressed logs
zgrep -i error /var/log/app.log.*.gz
zgrep -i error /var/log/app.log.*.gz
Error Pattern Recognition
错误模式识别
Common Error Patterns
Common Error Patterns
Connection Errors
Connection Errors
Pattern: "Connection refused" | "ECONNREFUSED" | "Connection timed out"
Cause: Service not running or firewall blocking
Fix: Check service status, verify port, check firewall rules
Pattern: "Connection refused" | "ECONNREFUSED" | "Connection timed out"
Cause: Service not running or firewall blocking
Fix: Check service status, verify port, check firewall rules
Memory Errors
Memory Errors
Pattern: "Out of memory" | "OOM" | "Cannot allocate memory"
Cause: Process exhausting available RAM
Fix: Increase memory, optimize application, add swap
Pattern: "Out of memory" | "OOM" | "Cannot allocate memory"
Cause: Process exhausting available RAM
Fix: Increase memory, optimize application, add swap
Pattern: "No space left on device" | "ENOSPC" | "Disk full"
Cause: Filesystem at capacity
Fix: Clean old files, increase disk, enable log rotation
Pattern: "No space left on device" | "ENOSPC" | "Disk full"
Cause: Filesystem at capacity
Fix: Clean old files, increase disk, enable log rotation
Permission Errors
Permission Errors
Pattern: "Permission denied" | "EACCES" | "Operation not permitted"
Cause: Insufficient file/directory permissions
Fix: Check ownership, verify permissions, check SELinux/AppArmor
Pattern: "Permission denied" | "EACCES" | "Operation not permitted"
Cause: Insufficient file/directory permissions
Fix: Check ownership, verify permissions, check SELinux/AppArmor
Database Errors
Database Errors
Pattern: "Too many connections" | "Connection pool exhausted"
Cause: Connection leak or undersized pool
Fix: Close unused connections, increase pool size, fix leaks
Pattern: "Too many connections" | "Connection pool exhausted"
Cause: Connection leak or undersized pool
Fix: Close unused connections, increase pool size, fix leaks
Specific Problem Templates
特定问题模板
Troubleshooting: API Not Responding
Troubleshooting: API Not Responding
Quick Diagnosis Script
Quick Diagnosis Script
api-health-check.sh
api-health-check.sh
1. DNS Resolution
1. DNS Resolution
echo "1. DNS Resolution..."
if host=$(dig +short $(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1) 2>/dev/null); then
echo " ✅ DNS resolves to: $host"
else
echo " ❌ DNS resolution failed"
fi
echo "1. DNS Resolution..."
if host=$(dig +short $(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1) 2>/dev/null); then
echo " ✅ DNS resolves to: $host"
else
echo " ❌ DNS resolution failed"
fi
2. Port Connectivity
2. Port Connectivity
echo "2. Port Connectivity..."
PORT=$(echo $API_URL | grep -oP ':\K[0-9]+' || echo "80")
HOST=$(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1)
if nc -z -w $TIMEOUT $HOST $PORT 2>/dev/null; then
echo " ✅ Port $PORT is open"
else
echo " ❌ Port $PORT is not reachable"
fi
echo "2. Port Connectivity..."
PORT=$(echo $API_URL | grep -oP ':\K[0-9]+' || echo "80")
HOST=$(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1)
if nc -z -w $TIMEOUT $HOST $PORT 2>/dev/null; then
echo " ✅ Port $PORT is open"
else
echo " ❌ Port $PORT is not reachable"
fi
3. HTTP Response
3. HTTP Response
echo "3. HTTP Response..."
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if [ "$HTTP_CODE" = "200" ]; then
echo " ✅ Health endpoint returns 200"
elif [ -n "$HTTP_CODE" ] && [ "$HTTP_CODE" != "000" ]; then
echo " ⚠️ Health endpoint returns $HTTP_CODE"
else
echo " ❌ No HTTP response"
fi
echo "3. HTTP Response..."
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if [ "$HTTP_CODE" = "200" ]; then
echo " ✅ Health endpoint returns 200"
elif [ -n "$HTTP_CODE" ] && [ "$HTTP_CODE" != "000" ]; then
echo " ⚠️ Health endpoint returns $HTTP_CODE"
else
echo " ❌ No HTTP response"
fi
4. Response Time
4. Response Time
echo "4. Response Time..."
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if (( $(echo "$RESPONSE_TIME < 1" | bc -l) )); then
echo " ✅ Response time: ${RESPONSE_TIME}s"
else
echo " ⚠️ Slow response: ${RESPONSE_TIME}s"
fi
echo
echo "=== Check Complete ==="
echo "4. Response Time..."
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if (( $(echo "$RESPONSE_TIME < 1" | bc -l) )); then
echo " ✅ Response time: ${RESPONSE_TIME}s"
else
echo " ⚠️ Slow response: ${RESPONSE_TIME}s"
fi
echo
echo "=== Check Complete ==="
Decision Tree
Decision Tree
API Not Responding
│
├─ Can you ping the server?
│ ├─ NO → Check network/DNS
│ └─ YES ↓
│
├─ Is the service running?
│ ├─ NO → Start/restart service
│ └─ YES ↓
│
├─ Is the port listening?
│ ├─ NO → Check service configuration
│ └─ YES ↓
│
├─ Does health check pass?
│ ├─ NO → Check dependencies (DB, cache)
│ └─ YES ↓
│
└─ Check application logs for errors
API Not Responding
│
├─ Can you ping the server?
│ ├─ NO → Check network/DNS
│ └─ YES ↓
│
├─ Is the service running?
│ ├─ NO → Start/restart service
│ └─ YES ↓
│
├─ Is the port listening?
│ ├─ NO → Check service configuration
│ └─ YES ↓
│
├─ Does health check pass?
│ ├─ NO → Check dependencies (DB, cache)
│ └─ YES ↓
│
└─ Check application logs for errors
Database Connection Issues
数据库连接问题
Troubleshooting: Database Connection Failed
Troubleshooting: Database Connection Failed
- Application shows "Connection refused" or "Connection timed out"
- Error: "FATAL: too many connections for role"
- Error: "FATAL: password authentication failed"
- Application shows "Connection refused" or "Connection timed out"
- Error: "FATAL: too many connections for role"
- Error: "FATAL: password authentication failed"
1. Verify Database is Running
1. Verify Database is Running
sudo systemctl status postgresql
pg_isready -h localhost -p 5432
sudo systemctl status postgresql
pg_isready -h localhost -p 5432
sudo systemctl status mysql
mysqladmin -u root -p ping
sudo systemctl status mysql
mysqladmin -u root -p ping
2. Test Connection
2. Test Connection
psql -h localhost -U username -d database -c "SELECT 1"
psql -h localhost -U username -d database -c "SELECT 1"
mysql -h localhost -u username -p -e "SELECT 1"
mysql -h localhost -u username -p -e "SELECT 1"
3. Check Connection Count
3. Check Connection Count
sql
-- PostgreSQL
SELECT count(*) FROM pg_stat_activity;
SELECT max_connections FROM pg_settings WHERE name = 'max_connections';
-- MySQL
SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';
sql
-- PostgreSQL
SELECT count(*) FROM pg_stat_activity;
SELECT max_connections FROM pg_settings WHERE name = 'max_connections';
-- MySQL
SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';
Solution 1: Restart Connection Pool
Solution 1: Restart Connection Pool
If using PgBouncer
If using PgBouncer
sudo systemctl restart pgbouncer
sudo systemctl restart pgbouncer
Application restart
Application restart
sudo systemctl restart myapp
sudo systemctl restart myapp
Solution 2: Clear Idle Connections
Solution 2: Clear Idle Connections
sql
-- PostgreSQL: Kill idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '10 minutes';
sql
-- PostgreSQL: Kill idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '10 minutes';
Solution 3: Increase Max Connections
Solution 3: Increase Max Connections
sql
-- PostgreSQL (requires restart)
ALTER SYSTEM SET max_connections = 200;
-- MySQL (can be done live)
SET GLOBAL max_connections = 200;
sql
-- PostgreSQL (requires restart)
ALTER SYSTEM SET max_connections = 200;
-- MySQL (can be done live)
SET GLOBAL max_connections = 200;
Quality Assurance Checklist
质量保证检查清单
Pre-Publication Review
发布前审核
Troubleshooting Guide Quality Checklist
Troubleshooting Guide Quality Checklist
- Начинай с симптомов — пользователь должен быстро понять, подходит ли гайд
- Простое решение первым — проверь очевидные причины до сложной диагностики
- Включай verification steps — как понять, что проблема решена
- Документируй rollback — возможность отката если fix не помог
- Указывай время — пользователь должен знать сколько займёт каждый шаг
- Тестируй на новичках — гайд должен работать для тех, кто не знает систему
- Обновляй регулярно — устаревший гайд хуже чем его отсутствие
- Включай escalation path — когда и к кому обращаться
- 从症状入手——用户需快速判断指南是否适用于其问题
- 简单方案优先——在进行复杂诊断前先检查明显原因
- 包含验证步骤——明确如何确认问题已解决
- 记录回滚步骤——若修复无效,可进行回滚
- 标注时间——用户需了解每个步骤所需时长
- 在新手身上测试——指南需对不熟悉系统的用户友好
- 定期更新——过时的指南比没有指南更糟
- 包含升级路径——明确何时及向谁求助