proxmox-admin
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseProxmox VE Operations Expertise
Proxmox VE运维专业能力
AI Agent Skill: Operational knowledge for managing Proxmox Virtual Environment infrastructure
AI Agent Skill:管理Proxmox Virtual Environment基础设施的运维知识
Overview
概述
This skill provides AI agents with operational expertise for Proxmox VE, covering:
- VM and LXC lifecycle management - From creation to decommissioning
- Storage operations - Configuration, content management, backup strategies
- High Availability - HA groups, resource management, failover
- Cluster operations - Multi-node management, migration, replication
- Certificate management - Installation, renewal, ACME integration
- ACME configuration - Provider setup, certificate ordering, automation
- Notifications - Target configuration, delivery verification, alerting
- Troubleshooting - Common issues, API quirks, resolution patterns
- Security - Permission models, API token best practices
- Performance - Monitoring, resource optimization
Target audience: AI agents performing day-to-day Proxmox operations, infrastructure automation, or incident response.
本技能为AI Agent提供Proxmox VE的运维专业能力,涵盖:
- VM与LXC生命周期管理——从创建到退役
- 存储操作——配置、内容管理、备份策略
- 高可用(HA)——HA组、资源管理、故障转移
- 集群操作——多节点管理、迁移、复制
- 证书管理——安装、续订、ACME集成
- ACME配置——提供商设置、证书申请、自动化
- 通知——目标配置、交付验证、告警
- 故障排查——常见问题、API特性、解决模式
- 安全——权限模型、API令牌最佳实践
- 性能——监控、资源优化
目标受众:执行日常Proxmox运维、基础设施自动化或事件响应的AI Agent。
Architecture Overview
架构概述
Proxmox VE Cluster Concepts
Proxmox VE集群概念
Node: Physical server running Proxmox VE
- Hosts VMs and LXC containers
- Provides local storage
- Participates in cluster quorum
Storage: Shared or local storage backends
- Types: Directory, LVM, ZFS, Ceph, NFS, iSCSI
- Content types: Images, ISOs, backups, templates
- Can be node-local or cluster-shared
Networking: Virtual networking infrastructure
- Linux bridges for VM/LXC connectivity
- VLANs for network segmentation
- SDN (Software Defined Networking) for advanced scenarios
Cluster: Group of nodes working together
- Shared configuration via pmxcfs
- HA for automatic failover
- Live migration between nodes
节点(Node):运行Proxmox VE的物理服务器
- 托管VM和LXC容器
- 提供本地存储
- 参与集群仲裁
存储:共享或本地存储后端
- 类型:目录(Directory)、LVM、ZFS、Ceph、NFS、iSCSI
- 内容类型:镜像、ISO、备份、模板
- 可为节点本地或集群共享
网络:虚拟网络基础设施
- 用于VM/LXC连接的Linux网桥
- 用于网络分段的VLAN
- 用于高级场景的SDN(软件定义网络)
集群:协同工作的节点组
- 通过pmxcfs共享配置
- 支持HA自动故障转移
- 节点间实时迁移
Operations Playbook
运维手册
1. VM Lifecycle Management
1. VM生命周期管理
Create → Configure → Monitor → Backup → Delete
创建 → 配置 → 监控 → 备份 → 删除
Creation
创建
1. Get next available VMID
2. Create VM with basic config (CPU, memory, OS type)
3. Add disk(s) from storage
4. Configure network interface(s)
5. Set boot order
6. Start VMKey considerations:
- Choose appropriate storage for disk (performance vs capacity)
- Use virtio drivers for best performance (requires guest support)
- Configure QEMU guest agent for better management
1. 获取下一个可用的VMID
2. 使用基础配置(CPU、内存、操作系统类型)创建VM
3. 从存储添加磁盘
4. 配置网络接口
5. 设置启动顺序
6. 启动VM关键注意事项:
- 为磁盘选择合适的存储(性能vs容量)
- 使用virtio驱动以获得最佳性能(需要客户机支持)
- 配置QEMU客户机代理以实现更优管理
Configuration
配置
1. Review current config
2. Resize resources (CPU, memory, disk) as needed
3. Add/remove network interfaces
4. Configure firewall rules
5. Set up snapshots for rollback capabilityBest practices:
- Snapshot before major changes
- Use cloud-init for automated provisioning
- Enable QEMU guest agent for graceful operations
1. 查看当前配置
2. 根据需要调整资源(CPU、内存、磁盘)
3. 添加/移除网络接口
4. 配置防火墙规则
5. 设置快照以支持回滚最佳实践:
- 进行重大变更前创建快照
- 使用cloud-init实现自动化部署
- 启用QEMU客户机代理以支持优雅操作
Monitoring
监控
1. Check VM status (running, stopped, paused)
2. Monitor resource usage (CPU, memory, disk I/O)
3. Review task history for recent operations
4. Check logs for errors or warningsMetrics to watch:
- CPU usage and steal time
- Memory pressure and swap usage
- Disk I/O wait times
- Network throughput
1. 检查VM状态(运行、停止、暂停)
2. 监控资源使用情况(CPU、内存、磁盘I/O)
3. 查看近期操作的任务历史
4. 检查日志中的错误或警告需关注的指标:
- CPU使用率和窃取时间
- 内存压力和交换分区使用情况
- 磁盘I/O等待时间
- 网络吞吐量
Backup
备份
1. Create snapshot for quick rollback
2. Schedule backup job (vzdump)
3. Verify backup completed successfully
4. Test restore periodically
5. Prune old backups to manage spaceBackup strategies:
- Snapshot mode: Fast, requires storage support
- Suspend mode: Pauses VM during backup
- Stop mode: Stops VM for consistent backup
1. 创建快照以快速回滚
2. 调度备份任务(vzdump)
3. 验证备份是否成功完成
4. 定期测试恢复流程
5. 清理旧备份以管理存储空间备份策略:
- 快照模式:速度快,需要存储支持
- 挂起模式:备份期间暂停VM
- 停止模式:停止VM以获得一致性备份
Decommissioning
退役
1. Create final backup
2. Stop VM gracefully
3. Remove from HA if configured
4. Delete VM and associated disks
5. Clean up firewall rules
6. Update documentation1. 创建最终备份
2. 优雅停止VM
3. 若已配置HA,将其从HA中移除
4. 删除VM及关联磁盘
5. 清理防火墙规则
6. 更新文档2. LXC Container Management
2. LXC容器管理
Containers vs VMs:
- Lighter weight (shared kernel)
- Faster startup times
- Lower overhead
- Less isolation than VMs
容器vs虚拟机:
- 更轻量(共享内核)
- 启动速度更快
- 开销更低
- 隔离性弱于VM
Container Operations
容器操作
1. Create from template
2. Configure resources (CPU, memory, swap)
3. Add mount points for storage
4. Configure network
5. Start container
6. Access via console or SSHKey differences from VMs:
- Use ,
mp0for mount points (not disk0, disk1)mp1 - No BIOS/UEFI configuration
- Direct kernel access (privileged) or restricted (unprivileged)
- Faster snapshot/restore operations
1. 从模板创建容器
2. 配置资源(CPU、内存、交换分区)
3. 添加存储挂载点
4. 配置网络
5. 启动容器
6. 通过控制台或SSH访问与VM的关键差异:
- 使用、
mp0作为挂载点(而非disk0、disk1)mp1 - 无BIOS/UEFI配置
- 可直接访问内核(特权模式)或受限访问(非特权模式)
- 快照/恢复操作速度更快
3. Storage Management
3. 存储管理
Storage Configuration
存储配置
1. List available storage
2. Add new storage backend (NFS, Ceph, etc.)
3. Configure content types (images, backups, ISOs)
4. Set storage as default for specific content
5. Monitor storage usage1. 列出可用存储
2. 添加新的存储后端(NFS、Ceph等)
3. 配置内容类型(镜像、备份、ISO)
4. 将存储设置为特定内容的默认存储
5. 监控存储使用率Content Management
内容管理
1. Upload ISOs/templates to storage
2. Download from URL to storage
3. List storage content
4. Delete unused content
5. Restore files from backups1. 上传ISO/模板至存储
2. 从URL下载至存储
3. 列出存储内容
4. 删除未使用的内容
5. 从备份中恢复文件Backup Management
备份管理
1. Create backup jobs (manual or scheduled)
2. Configure retention policy
3. Prune old backups automatically
4. Restore from backup
5. Verify backup integrityBackup best practices:
- Use compression to save space
- Store backups on separate storage
- Test restore procedures regularly
- Document backup schedules
- Monitor backup job success/failure
1. 创建备份任务(手动或调度)
2. 配置保留策略
3. 自动清理旧备份
4. 从备份恢复
5. 验证备份完整性备份最佳实践:
- 使用压缩以节省空间
- 将备份存储在独立存储上
- 定期测试恢复流程
- 记录备份调度
- 监控备份任务的成功/失败状态
4. High Availability (HA)
4. 高可用(HA)
HA Configuration
HA配置
1. Create HA group (define node priorities)
2. Add VM/LXC to HA management
3. Configure HA settings (max relocate, max restart)
4. Monitor HA status
5. Test failover scenariosHA States:
- started: Resource running on assigned node
- stopped: Resource intentionally stopped
- fence: Node fenced, resource will be restarted elsewhere
- error: HA manager encountered an error
When to use HA:
- Critical services requiring high uptime
- Automatic failover needed
- Cluster has 3+ nodes (for quorum)
1. 创建HA组(定义节点优先级)
2. 将VM/LXC添加至HA管理
3. 配置HA设置(最大重定位数、最大重启数)
4. 监控HA状态
5. 测试故障转移场景HA状态:
- started:资源在指定节点上运行
- stopped:资源被有意停止
- fence:节点被隔离,资源将在其他节点重启
- error:HA管理器遇到错误
何时使用HA:
- 需要高可用性的关键服务
- 需要自动故障转移
- 集群包含3个及以上节点(用于仲裁)
5. Migration
5. 迁移
Live Migration (Online)
实时迁移(在线)
1. Verify target node has resources
2. Check shared storage access
3. Initiate migration
4. Monitor migration progress
5. Verify VM running on new nodeRequirements:
- Shared storage for VM disks
- Network connectivity between nodes
- Compatible CPU types (or CPU flags masked)
1. 验证目标节点有足够资源
2. 检查共享存储访问权限
3. 启动迁移
4. 监控迁移进度
5. 验证VM在新节点上正常运行要求:
- VM磁盘使用共享存储
- 节点间网络连通
- CPU类型兼容(或屏蔽CPU标志)
Offline Migration
离线迁移
1. Stop VM/LXC
2. Migrate to target node
3. Start on new nodeUse cases:
- No shared storage available
- Maintenance on source node
- CPU incompatibility
1. 停止VM/LXC
2. 迁移至目标节点
3. 在新节点启动适用场景:
- 无共享存储可用
- 源节点需要维护
- CPU不兼容
Troubleshooting Guide
故障排查指南
Common Issues
常见问题
1. VM Won't Start
1. VM无法启动
Symptoms: Start operation fails or VM immediately stops
Causes:
- Insufficient resources on node
- Storage unavailable
- Lock file present
- Configuration error
Resolution:
1. Check node resources (memory, CPU)
2. Verify storage is mounted and accessible
3. Remove lock file if stale
4. Review VM config for errors
5. Check logs: /var/log/pve/tasks/症状:启动操作失败或VM立即停止
原因:
- 节点资源不足
- 存储不可用
- 存在锁定文件
- 配置错误
解决步骤:
1. 检查节点资源(内存、CPU)
2. 验证存储已挂载且可访问
3. 若锁定文件已失效则删除
4. 检查VM配置中的错误
5. 查看日志:/var/log/pve/tasks/2. Migration Fails
2. 迁移失败
Symptoms: Migration operation errors or times out
Causes:
- Network connectivity issues
- Storage not shared
- CPU incompatibility
- Insufficient resources on target
Resolution:
1. Verify network between nodes
2. Check storage is accessible from both nodes
3. Review CPU flags compatibility
4. Ensure target node has capacity
5. Try offline migration if live fails症状:迁移操作报错或超时
原因:
- 网络连通性问题
- 存储未共享
- CPU不兼容
- 目标节点资源不足
解决步骤:
1. 验证节点间网络
2. 检查存储是否可被双方节点访问
3. 检查CPU标志兼容性
4. 确保目标节点有足够容量
5. 若实时迁移失败,尝试离线迁移3. Backup Job Fails
3. 备份任务失败
Symptoms: Backup task shows error status
Causes:
- Insufficient storage space
- VM locked by another operation
- Snapshot creation failed
- Network timeout (for remote storage)
Resolution:
1. Check storage space availability
2. Verify no other operations running on VM
3. Try manual backup to isolate issue
4. Review backup job logs
5. Prune old backups to free space症状:备份任务显示错误状态
原因:
- 存储空间不足
- VM被其他操作锁定
- 快照创建失败
- 网络超时(针对远程存储)
解决步骤:
1. 检查存储空间可用性
2. 验证VM上无其他运行中的操作
3. 尝试手动备份以隔离问题
4. 查看备份任务日志
5. 清理旧备份以释放空间4. HA Failover Not Working
4. HA故障转移不工作
Symptoms: VM doesn't restart on another node after failure
Causes:
- Cluster quorum lost
- HA service not running
- Fencing not configured
- All nodes in HA group unavailable
Resolution:
1. Check cluster quorum status
2. Verify HA service running on all nodes
3. Review HA group configuration
4. Check fencing configuration
5. Manually start VM if needed症状:节点故障后VM未在其他节点重启
原因:
- 集群丢失仲裁
- HA服务未运行
- 未配置隔离机制
- HA组中所有节点不可用
解决步骤:
1. 检查集群仲裁状态
2. 验证所有节点上的HA服务均在运行
3. 检查HA组配置
4. 检查隔离机制配置
5. 必要时手动启动VM5. Storage Performance Issues
5. 存储性能问题
Symptoms: Slow VM performance, high I/O wait
Causes:
- Storage backend overloaded
- Network bottleneck (for remote storage)
- Disk cache settings suboptimal
- Too many VMs on same storage
Resolution:
1. Monitor storage backend performance
2. Check network throughput to storage
3. Adjust VM disk cache settings
4. Distribute VMs across multiple storage
5. Consider faster storage tierMore troubleshooting: See proxmox-troubleshooting.md
症状:VM性能缓慢,I/O等待时间长
原因:
- 存储后端过载
- 网络瓶颈(针对远程存储)
- 磁盘缓存设置不理想
- 同一存储上的VM过多
解决步骤:
1. 监控存储后端性能
2. 检查与存储的网络吞吐量
3. 调整VM磁盘缓存设置
4. 将VM分散到多个存储上
5. 考虑使用更快的存储层级更多故障排查:参见proxmox-troubleshooting.md
Security Best Practices
安全最佳实践
API Token Management
API令牌管理
Token creation:
- Create dedicated user for automation
- Assign minimal required permissions
- Generate API token (not password)
- Store token securely (environment variables, secrets manager)
- Rotate tokens periodically
Permission model:
- Use roles to group permissions
- Assign roles to users/tokens
- Follow principle of least privilege
- Audit permission usage regularly
令牌创建:
- 为自动化创建专用用户
- 分配最小必要权限
- 生成API令牌(而非密码)
- 安全存储令牌(环境变量、密钥管理器)
- 定期轮换令牌
权限模型:
- 使用角色对权限进行分组
- 为用户/令牌分配角色
- 遵循最小权限原则
- 定期审计权限使用情况
Access Control
访问控制
User management:
- Use realms (PAM, LDAP, AD) for authentication
- Create groups for role-based access
- Assign users to groups
- Review access periodically
Network security:
- Restrict API access by IP (firewall rules)
- Use SSL/TLS for API connections
- Enable two-factor authentication for users
- Monitor authentication logs
用户管理:
- 使用域(PAM、LDAP、AD)进行身份验证
- 创建组以实现基于角色的访问
- 将用户分配到组
- 定期审查访问权限
网络安全:
- 通过IP限制API访问(防火墙规则)
- API连接使用SSL/TLS
- 为用户启用双因素认证
- 监控身份验证日志
Performance Optimization
性能优化
Resource Allocation
资源分配
CPU:
- Don't overcommit CPU cores excessively
- Use CPU limits for non-critical VMs
- Pin CPUs for latency-sensitive workloads
- Monitor CPU steal time
Memory:
- Enable ballooning for dynamic allocation
- Set appropriate memory limits
- Monitor swap usage (should be minimal)
- Use hugepages for large memory VMs
Disk I/O:
- Use virtio-scsi for best performance
- Enable discard/TRIM for SSDs
- Configure appropriate I/O scheduler
- Monitor disk latency and throughput
CPU:
- 不要过度超额分配CPU核心
- 为非关键VM设置CPU限制
- 为延迟敏感型工作负载绑定CPU
- 监控CPU窃取时间
内存:
- 启用气球技术以实现动态分配
- 设置合适的内存限制
- 监控交换分区使用情况(应尽可能少)
- 为大内存VM使用巨页
磁盘I/O:
- 使用virtio-scsi以获得最佳性能
- 为SSD启用discard/TRIM
- 配置合适的I/O调度器
- 监控磁盘延迟和吞吐量
Monitoring Strategy
监控策略
Key metrics:
- Node CPU, memory, disk usage
- VM resource consumption
- Storage performance (IOPS, latency)
- Network throughput
- Task completion times
Monitoring tools:
- Built-in Proxmox metrics (RRD data)
- External monitoring (Prometheus, Grafana)
- Log aggregation (syslog, ELK stack)
- Alerting for critical thresholds
关键指标:
- 节点CPU、内存、磁盘使用率
- VM资源消耗
- 存储性能(IOPS、延迟)
- 网络吞吐量
- 任务完成时间
监控工具:
- Proxmox内置指标(RRD数据)
- 外部监控(Prometheus、Grafana)
- 日志聚合(syslog、ELK栈)
- 关键阈值告警
Operational Workflows
运维工作流
For detailed step-by-step workflows, see:
- proxmox-workflows.md - Common operational patterns
For troubleshooting details, see:
- proxmox-troubleshooting.md - API quirks and solutions
如需详细的分步工作流,请参见:
- proxmox-workflows.md——常见运维模式
如需详细故障排查内容,请参见:
- proxmox-troubleshooting.md——API特性及解决方案
Quick Reference
快速参考
VM States
VM状态
- running: VM is powered on
- stopped: VM is powered off
- paused: VM execution suspended
- suspended: VM state saved to disk
- running:VM已开机
- stopped:VM已关机
- paused:VM执行被暂停
- suspended:VM状态已保存至磁盘
Storage Types
存储类型
- dir: Directory-based storage
- lvm: LVM volume groups
- zfs: ZFS pools
- ceph: Ceph RBD
- nfs: NFS shares
- iscsi: iSCSI targets
- dir:基于目录的存储
- lvm:LVM卷组
- zfs:ZFS池
- ceph:Ceph RBD
- nfs:NFS共享
- iscsi:iSCSI目标
Backup Modes
备份模式
- snapshot: Fast, requires storage support
- suspend: Pauses VM during backup
- stop: Stops VM for backup
- snapshot:速度快,需要存储支持
- suspend:备份期间暂停VM
- stop:停止VM以执行备份
HA States
HA状态
- started: Running on assigned node
- stopped: Intentionally stopped
- fence: Node fenced, restarting elsewhere
- error: HA manager error
- started:在指定节点运行
- stopped:被有意停止
- fence:节点被隔离,将在其他节点重启
- error:HA管理器出错
License
许可证
MIT License - Part of @bldg-7/proxmox-mcp package
MIT许可证——属于@bldg-7/proxmox-mcp包