Loading...
Loading...
Эксперт по troubleshooting гайдам. Используй для создания диагностических процедур, FAQ и решения проблем.
npx skill4agent add dengineproblem/agents-monorepo troubleshooting-guidetroubleshooting_principles:
- principle: "Start with clear problem statements and symptoms"
reason: "Users need to quickly identify if guide applies to their issue"
- principle: "Use If-Then logic flows for decision trees"
reason: "Systematic elimination of possible causes"
- principle: "Organize solutions by likelihood and impact"
reason: "Try simple/common fixes first, escalate to complex"
- principle: "Follow logical diagnostic sequence (simple to complex)"
reason: "Minimize time to resolution"
- principle: "Include verification steps after each fix"
reason: "Confirm the issue is actually resolved"
- principle: "Provide rollback instructions"
reason: "Allow safe recovery if fix causes new issues"# Troubleshooting: [Problem Title]
**Last Updated:** [Date]
**Applies To:** [Product/Service/Version]
**Difficulty:** Beginner | Intermediate | Advanced
**Time Estimate:** X-Y minutes
---
## Problem Statement
### Symptoms
Users experiencing this issue will observe:
- [ ] Symptom 1 (observable behavior)
- [ ] Symptom 2 (error message or code)
- [ ] Symptom 3 (system state)
### Error Messages
### Affected Components
- Component A
- Component B
### Impact
- **Severity:** Critical | High | Medium | Low
- **Affected Users:** All users | Specific group | Single user
- **Business Impact:** [Description]
---
## Quick Checks (2-5 minutes)
Before diving into detailed troubleshooting, verify these common causes:
### Check 1: [Most Common Cause]
**Time:** 30 seconds
```bash
# Command to verify
[diagnostic command]# Commands to gather diagnostic info
[command 1]
[command 2]Start
│
├─ Is [condition A] true?
│ ├─ YES → Go to Solution A
│ └─ NO → Continue
│
├─ Is [condition B] true?
│ ├─ YES → Go to Solution B
│ └─ NO → Continue
│
└─ None of the above → Escalate to Support[command][verification command][rollback command]| Date | Author | Changes |
|---|---|---|
| [Date] | [Name] | Initial version |
| [Date] | [Name] | Added Solution C |
---
## Diagnostic Patterns
### Layer-by-Layer Approach
```markdown
## Network Connectivity Troubleshooting
### Layer 1: Physical
- [ ] Check cable connections
- [ ] Verify link lights are active
- [ ] Test with known-good cable
### Layer 2: Data Link
- [ ] Verify MAC address is learned
- [ ] Check for VLAN misconfigurations
- [ ] Review spanning tree state
### Layer 3: Network
- [ ] Verify IP configuration
- [ ] Test ping to gateway
- [ ] Check routing table
### Layer 4: Transport
- [ ] Verify service is listening on correct port
- [ ] Check firewall rules
- [ ] Test with telnet/nc to port
### Layer 7: Application
- [ ] Check application logs
- [ ] Verify configuration files
- [ ] Test with curl/wget## Identifying Faulty Component
Use binary search to isolate the issue:
### Step 1: Test Midpoint
Test the system at the midpoint of the data flow:
**If working at midpoint:** Issue is between midpoint and client
**If failing at midpoint:** Issue is between midpoint and database
### Step 2: Narrow Down
Repeat the process, testing the midpoint of the remaining segment.
### Step 3: Isolate
Continue until you've identified the specific failing component.## Application Not Responding
undefinedlinux_logs:
system:
- /var/log/syslog
- /var/log/messages
- journalctl -xe
application:
- /var/log/[app-name]/
- ~/.pm2/logs/
- docker logs [container]
web_server:
nginx:
- /var/log/nginx/error.log
- /var/log/nginx/access.log
apache:
- /var/log/apache2/error.log
- /var/log/httpd/error_log
database:
postgresql:
- /var/log/postgresql/
mysql:
- /var/log/mysql/error.log# Find errors in last 100 lines
tail -100 /var/log/app.log | grep -i error
# Find errors with timestamp
grep -i error /var/log/app.log | tail -50
# Watch log in real-time
tail -f /var/log/app.log | grep --line-buffered -i error
# Count errors by type
grep -i error /var/log/app.log | sort | uniq -c | sort -rn | head -20
# Find entries around specific time
awk '/2024-01-15 14:3[0-5]/' /var/log/app.log
# Extract specific fields (JSON logs)
cat /var/log/app.json | jq 'select(.level == "error") | {time, message}'
# Search compressed logs
zgrep -i error /var/log/app.log.*.gz## Common Error Patterns
### Connection Errors
### Memory Errors
### Disk Errors
### Permission Errors
### Database Errorsundefined# Troubleshooting: API Not Responding
## Quick Diagnosis Script
```bash
#!/bin/bash
# api-health-check.sh
API_URL="${1:-http://localhost:8080}"
TIMEOUT=5
echo "=== API Health Check ==="
echo "Target: $API_URL"
echo
# 1. DNS Resolution
echo "1. DNS Resolution..."
if host=$(dig +short $(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1) 2>/dev/null); then
echo " ✅ DNS resolves to: $host"
else
echo " ❌ DNS resolution failed"
fi
# 2. Port Connectivity
echo "2. Port Connectivity..."
PORT=$(echo $API_URL | grep -oP ':\K[0-9]+' || echo "80")
HOST=$(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1)
if nc -z -w $TIMEOUT $HOST $PORT 2>/dev/null; then
echo " ✅ Port $PORT is open"
else
echo " ❌ Port $PORT is not reachable"
fi
# 3. HTTP Response
echo "3. HTTP Response..."
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if [ "$HTTP_CODE" = "200" ]; then
echo " ✅ Health endpoint returns 200"
elif [ -n "$HTTP_CODE" ] && [ "$HTTP_CODE" != "000" ]; then
echo " ⚠️ Health endpoint returns $HTTP_CODE"
else
echo " ❌ No HTTP response"
fi
# 4. Response Time
echo "4. Response Time..."
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if (( $(echo "$RESPONSE_TIME < 1" | bc -l) )); then
echo " ✅ Response time: ${RESPONSE_TIME}s"
else
echo " ⚠️ Slow response: ${RESPONSE_TIME}s"
fi
echo
echo "=== Check Complete ==="API Not Responding
│
├─ Can you ping the server?
│ ├─ NO → Check network/DNS
│ └─ YES ↓
│
├─ Is the service running?
│ ├─ NO → Start/restart service
│ └─ YES ↓
│
├─ Is the port listening?
│ ├─ NO → Check service configuration
│ └─ YES ↓
│
├─ Does health check pass?
│ ├─ NO → Check dependencies (DB, cache)
│ └─ YES ↓
│
└─ Check application logs for errors
### Database Connection Issues
```markdown
# Troubleshooting: Database Connection Failed
## Symptoms
- Application shows "Connection refused" or "Connection timed out"
- Error: "FATAL: too many connections for role"
- Error: "FATAL: password authentication failed"
## Quick Checks
### 1. Verify Database is Running
```bash
# PostgreSQL
sudo systemctl status postgresql
pg_isready -h localhost -p 5432
# MySQL
sudo systemctl status mysql
mysqladmin -u root -p ping# PostgreSQL
psql -h localhost -U username -d database -c "SELECT 1"
# MySQL
mysql -h localhost -u username -p -e "SELECT 1"-- PostgreSQL
SELECT count(*) FROM pg_stat_activity;
SELECT max_connections FROM pg_settings WHERE name = 'max_connections';
-- MySQL
SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';# If using PgBouncer
sudo systemctl restart pgbouncer
# Application restart
sudo systemctl restart myapp-- PostgreSQL: Kill idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '10 minutes';-- PostgreSQL (requires restart)
ALTER SYSTEM SET max_connections = 200;
-- MySQL (can be done live)
SET GLOBAL max_connections = 200;
---
## Quality Assurance Checklist
### Pre-Publication Review
```markdown
## Troubleshooting Guide Quality Checklist
### Accuracy
- [ ] All commands tested and verified
- [ ] Output examples are accurate
- [ ] Links to resources are valid
- [ ] Version numbers are current
### Completeness
- [ ] All common causes covered
- [ ] Rollback instructions provided
- [ ] Escalation path defined
- [ ] Prevention tips included
### Usability
- [ ] Clear success/failure criteria
- [ ] Time estimates accurate
- [ ] Difficulty levels appropriate
- [ ] Tested by someone unfamiliar with issue
### Formatting
- [ ] Consistent heading structure
- [ ] Code blocks properly formatted
- [ ] Decision trees clear
- [ ] Screenshots/diagrams where helpful
### Maintenance
- [ ] Last updated date included
- [ ] Revision history maintained
- [ ] Owner/contact identified
- [ ] Review schedule established