proxmox-admin

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Proxmox VE Operations Expertise

Proxmox VE运维专业能力

AI Agent Skill: Operational knowledge for managing Proxmox Virtual Environment infrastructure

AI Agent Skill：管理Proxmox Virtual Environment基础设施的运维知识

Overview

概述

This skill provides AI agents with operational expertise for Proxmox VE, covering:

VM and LXC lifecycle management - From creation to decommissioning
Storage operations - Configuration, content management, backup strategies
High Availability - HA groups, resource management, failover
Cluster operations - Multi-node management, migration, replication
Certificate management - Installation, renewal, ACME integration
ACME configuration - Provider setup, certificate ordering, automation
Notifications - Target configuration, delivery verification, alerting
Troubleshooting - Common issues, API quirks, resolution patterns
Security - Permission models, API token best practices
Performance - Monitoring, resource optimization

Target audience: AI agents performing day-to-day Proxmox operations, infrastructure automation, or incident response.

本技能为AI Agent提供Proxmox VE的运维专业能力，涵盖：

VM与LXC生命周期管理——从创建到退役
存储操作——配置、内容管理、备份策略
高可用（HA）——HA组、资源管理、故障转移
集群操作——多节点管理、迁移、复制
证书管理——安装、续订、ACME集成
ACME配置——提供商设置、证书申请、自动化
通知——目标配置、交付验证、告警
故障排查——常见问题、API特性、解决模式
安全——权限模型、API令牌最佳实践
性能——监控、资源优化

目标受众：执行日常Proxmox运维、基础设施自动化或事件响应的AI Agent。

Architecture Overview

架构概述

Proxmox VE Cluster Concepts

Proxmox VE集群概念

Node: Physical server running Proxmox VE

Hosts VMs and LXC containers
Provides local storage
Participates in cluster quorum

Storage: Shared or local storage backends

Types: Directory, LVM, ZFS, Ceph, NFS, iSCSI
Content types: Images, ISOs, backups, templates
Can be node-local or cluster-shared

Networking: Virtual networking infrastructure

Linux bridges for VM/LXC connectivity
VLANs for network segmentation
SDN (Software Defined Networking) for advanced scenarios

Cluster: Group of nodes working together

Shared configuration via pmxcfs
HA for automatic failover
Live migration between nodes

节点（Node）：运行Proxmox VE的物理服务器

托管VM和LXC容器
提供本地存储
参与集群仲裁

存储：共享或本地存储后端

类型：目录（Directory）、LVM、ZFS、Ceph、NFS、iSCSI
内容类型：镜像、ISO、备份、模板
可为节点本地或集群共享

网络：虚拟网络基础设施

用于VM/LXC连接的Linux网桥
用于网络分段的VLAN
用于高级场景的SDN（软件定义网络）

集群：协同工作的节点组

通过pmxcfs共享配置
支持HA自动故障转移
节点间实时迁移

Operations Playbook

运维手册

1. VM Lifecycle Management

1. VM生命周期管理

Create → Configure → Monitor → Backup → Delete

创建 → 配置 → 监控 → 备份 → 删除

Creation

创建

1. Get next available VMID
2. Create VM with basic config (CPU, memory, OS type)
3. Add disk(s) from storage
4. Configure network interface(s)
5. Set boot order
6. Start VM

Key considerations:

Choose appropriate storage for disk (performance vs capacity)
Use virtio drivers for best performance (requires guest support)
Configure QEMU guest agent for better management

1. 获取下一个可用的VMID
2. 使用基础配置（CPU、内存、操作系统类型）创建VM
3. 从存储添加磁盘
4. 配置网络接口
5. 设置启动顺序
6. 启动VM

关键注意事项：

为磁盘选择合适的存储（性能vs容量）
使用virtio驱动以获得最佳性能（需要客户机支持）
配置QEMU客户机代理以实现更优管理

Configuration

配置

1. Review current config
2. Resize resources (CPU, memory, disk) as needed
3. Add/remove network interfaces
4. Configure firewall rules
5. Set up snapshots for rollback capability

Best practices:

Snapshot before major changes
Use cloud-init for automated provisioning
Enable QEMU guest agent for graceful operations

1. 查看当前配置
2. 根据需要调整资源（CPU、内存、磁盘）
3. 添加/移除网络接口
4. 配置防火墙规则
5. 设置快照以支持回滚

最佳实践：

进行重大变更前创建快照
使用cloud-init实现自动化部署
启用QEMU客户机代理以支持优雅操作

Monitoring

监控

1. Check VM status (running, stopped, paused)
2. Monitor resource usage (CPU, memory, disk I/O)
3. Review task history for recent operations
4. Check logs for errors or warnings

Metrics to watch:

CPU usage and steal time
Memory pressure and swap usage
Disk I/O wait times
Network throughput

1. 检查VM状态（运行、停止、暂停）
2. 监控资源使用情况（CPU、内存、磁盘I/O）
3. 查看近期操作的任务历史
4. 检查日志中的错误或警告

需关注的指标：

CPU使用率和窃取时间
内存压力和交换分区使用情况
磁盘I/O等待时间
网络吞吐量

Backup

备份

1. Create snapshot for quick rollback
2. Schedule backup job (vzdump)
3. Verify backup completed successfully
4. Test restore periodically
5. Prune old backups to manage space

Backup strategies:

Snapshot mode: Fast, requires storage support
Suspend mode: Pauses VM during backup
Stop mode: Stops VM for consistent backup

1. 创建快照以快速回滚
2. 调度备份任务（vzdump）
3. 验证备份是否成功完成
4. 定期测试恢复流程
5. 清理旧备份以管理存储空间

备份策略：

快照模式：速度快，需要存储支持
挂起模式：备份期间暂停VM
停止模式：停止VM以获得一致性备份

Decommissioning

退役

1. Create final backup
2. Stop VM gracefully
3. Remove from HA if configured
4. Delete VM and associated disks
5. Clean up firewall rules
6. Update documentation

1. 创建最终备份
2. 优雅停止VM
3. 若已配置HA，将其从HA中移除
4. 删除VM及关联磁盘
5. 清理防火墙规则
6. 更新文档

2. LXC Container Management

2. LXC容器管理

Containers vs VMs:

Lighter weight (shared kernel)
Faster startup times
Lower overhead
Less isolation than VMs

容器vs虚拟机：

更轻量（共享内核）
启动速度更快
开销更低
隔离性弱于VM

Container Operations

容器操作

1. Create from template
2. Configure resources (CPU, memory, swap)
3. Add mount points for storage
4. Configure network
5. Start container
6. Access via console or SSH

Key differences from VMs:

Use
```
mp0
```
,
```
mp1
```
for mount points (not disk0, disk1)
No BIOS/UEFI configuration
Direct kernel access (privileged) or restricted (unprivileged)
Faster snapshot/restore operations

1. 从模板创建容器
2. 配置资源（CPU、内存、交换分区）
3. 添加存储挂载点
4. 配置网络
5. 启动容器
6. 通过控制台或SSH访问

与VM的关键差异：

使用
```
mp0
```
、
```
mp1
```
作为挂载点（而非disk0、disk1）
无BIOS/UEFI配置
可直接访问内核（特权模式）或受限访问（非特权模式）
快照/恢复操作速度更快

3. Storage Management

3. 存储管理

Storage Configuration

存储配置

1. List available storage
2. Add new storage backend (NFS, Ceph, etc.)
3. Configure content types (images, backups, ISOs)
4. Set storage as default for specific content
5. Monitor storage usage

1. 列出可用存储
2. 添加新的存储后端（NFS、Ceph等）
3. 配置内容类型（镜像、备份、ISO）
4. 将存储设置为特定内容的默认存储
5. 监控存储使用率

Content Management

内容管理

1. Upload ISOs/templates to storage
2. Download from URL to storage
3. List storage content
4. Delete unused content
5. Restore files from backups

1. 上传ISO/模板至存储
2. 从URL下载至存储
3. 列出存储内容
4. 删除未使用的内容
5. 从备份中恢复文件

Backup Management

备份管理

1. Create backup jobs (manual or scheduled)
2. Configure retention policy
3. Prune old backups automatically
4. Restore from backup
5. Verify backup integrity

Backup best practices:

Use compression to save space
Store backups on separate storage
Test restore procedures regularly
Document backup schedules
Monitor backup job success/failure

1. 创建备份任务（手动或调度）
2. 配置保留策略
3. 自动清理旧备份
4. 从备份恢复
5. 验证备份完整性

备份最佳实践：

使用压缩以节省空间
将备份存储在独立存储上
定期测试恢复流程
记录备份调度
监控备份任务的成功/失败状态

4. High Availability (HA)

4. 高可用（HA）

HA Configuration

HA配置

1. Create HA group (define node priorities)
2. Add VM/LXC to HA management
3. Configure HA settings (max relocate, max restart)
4. Monitor HA status
5. Test failover scenarios

HA States:

started: Resource running on assigned node
stopped: Resource intentionally stopped
fence: Node fenced, resource will be restarted elsewhere
error: HA manager encountered an error

When to use HA:

Critical services requiring high uptime
Automatic failover needed
Cluster has 3+ nodes (for quorum)

1. 创建HA组（定义节点优先级）
2. 将VM/LXC添加至HA管理
3. 配置HA设置（最大重定位数、最大重启数）
4. 监控HA状态
5. 测试故障转移场景

HA状态：

started：资源在指定节点上运行
stopped：资源被有意停止
fence：节点被隔离，资源将在其他节点重启
error：HA管理器遇到错误

何时使用HA：

需要高可用性的关键服务
需要自动故障转移
集群包含3个及以上节点（用于仲裁）

5. Migration

5. 迁移

Live Migration (Online)

实时迁移（在线）

1. Verify target node has resources
2. Check shared storage access
3. Initiate migration
4. Monitor migration progress
5. Verify VM running on new node

Requirements:

Shared storage for VM disks
Network connectivity between nodes
Compatible CPU types (or CPU flags masked)

1. 验证目标节点有足够资源
2. 检查共享存储访问权限
3. 启动迁移
4. 监控迁移进度
5. 验证VM在新节点上正常运行

要求：

VM磁盘使用共享存储
节点间网络连通
CPU类型兼容（或屏蔽CPU标志）

Offline Migration

离线迁移

1. Stop VM/LXC
2. Migrate to target node
3. Start on new node

Use cases:

No shared storage available
Maintenance on source node
CPU incompatibility

1. 停止VM/LXC
2. 迁移至目标节点
3. 在新节点启动

适用场景：

无共享存储可用
源节点需要维护
CPU不兼容

Troubleshooting Guide

故障排查指南

Common Issues

常见问题

1. VM Won't Start

1. VM无法启动

Symptoms: Start operation fails or VM immediately stops

Causes:

Insufficient resources on node
Storage unavailable
Lock file present
Configuration error

Resolution:

1. Check node resources (memory, CPU)
2. Verify storage is mounted and accessible
3. Remove lock file if stale
4. Review VM config for errors
5. Check logs: /var/log/pve/tasks/

症状：启动操作失败或VM立即停止

原因：

节点资源不足
存储不可用
存在锁定文件
配置错误

解决步骤：

1. 检查节点资源（内存、CPU）
2. 验证存储已挂载且可访问
3. 若锁定文件已失效则删除
4. 检查VM配置中的错误
5. 查看日志：/var/log/pve/tasks/

2. Migration Fails

2. 迁移失败

Symptoms: Migration operation errors or times out

Causes:

Network connectivity issues
Storage not shared
CPU incompatibility
Insufficient resources on target

Resolution:

1. Verify network between nodes
2. Check storage is accessible from both nodes
3. Review CPU flags compatibility
4. Ensure target node has capacity
5. Try offline migration if live fails

症状：迁移操作报错或超时

原因：

网络连通性问题
存储未共享
CPU不兼容
目标节点资源不足

解决步骤：

1. 验证节点间网络
2. 检查存储是否可被双方节点访问
3. 检查CPU标志兼容性
4. 确保目标节点有足够容量
5. 若实时迁移失败，尝试离线迁移

3. Backup Job Fails

3. 备份任务失败

Symptoms: Backup task shows error status

Causes:

Insufficient storage space
VM locked by another operation
Snapshot creation failed
Network timeout (for remote storage)

Resolution:

1. Check storage space availability
2. Verify no other operations running on VM
3. Try manual backup to isolate issue
4. Review backup job logs
5. Prune old backups to free space

症状：备份任务显示错误状态

原因：

存储空间不足
VM被其他操作锁定
快照创建失败
网络超时（针对远程存储）

解决步骤：

1. 检查存储空间可用性
2. 验证VM上无其他运行中的操作
3. 尝试手动备份以隔离问题
4. 查看备份任务日志
5. 清理旧备份以释放空间

4. HA Failover Not Working

4. HA故障转移不工作

Symptoms: VM doesn't restart on another node after failure

Causes:

Cluster quorum lost
HA service not running
Fencing not configured
All nodes in HA group unavailable

Resolution:

1. Check cluster quorum status
2. Verify HA service running on all nodes
3. Review HA group configuration
4. Check fencing configuration
5. Manually start VM if needed

症状：节点故障后VM未在其他节点重启

原因：

集群丢失仲裁
HA服务未运行
未配置隔离机制
HA组中所有节点不可用

解决步骤：

1. 检查集群仲裁状态
2. 验证所有节点上的HA服务均在运行
3. 检查HA组配置
4. 检查隔离机制配置
5. 必要时手动启动VM

5. Storage Performance Issues

5. 存储性能问题

Symptoms: Slow VM performance, high I/O wait

Causes:

Storage backend overloaded
Network bottleneck (for remote storage)
Disk cache settings suboptimal
Too many VMs on same storage

Resolution:

1. Monitor storage backend performance
2. Check network throughput to storage
3. Adjust VM disk cache settings
4. Distribute VMs across multiple storage
5. Consider faster storage tier

More troubleshooting: See proxmox-troubleshooting.md

症状：VM性能缓慢，I/O等待时间长

原因：

存储后端过载
网络瓶颈（针对远程存储）
磁盘缓存设置不理想
同一存储上的VM过多

解决步骤：

1. 监控存储后端性能
2. 检查与存储的网络吞吐量
3. 调整VM磁盘缓存设置
4. 将VM分散到多个存储上
5. 考虑使用更快的存储层级

更多故障排查：参见proxmox-troubleshooting.md

Security Best Practices

安全最佳实践

API Token Management

API令牌管理

Token creation:

Create dedicated user for automation
Assign minimal required permissions
Generate API token (not password)
Store token securely (environment variables, secrets manager)
Rotate tokens periodically

Permission model:

Use roles to group permissions
Assign roles to users/tokens
Follow principle of least privilege
Audit permission usage regularly

令牌创建：

为自动化创建专用用户
分配最小必要权限
生成API令牌（而非密码）
安全存储令牌（环境变量、密钥管理器）
定期轮换令牌

权限模型：

使用角色对权限进行分组
为用户/令牌分配角色
遵循最小权限原则
定期审计权限使用情况

Access Control

访问控制

User management:

Use realms (PAM, LDAP, AD) for authentication
Create groups for role-based access
Assign users to groups
Review access periodically

Network security:

Restrict API access by IP (firewall rules)
Use SSL/TLS for API connections
Enable two-factor authentication for users
Monitor authentication logs

用户管理：

使用域（PAM、LDAP、AD）进行身份验证
创建组以实现基于角色的访问
将用户分配到组
定期审查访问权限

网络安全：

通过IP限制API访问（防火墙规则）
API连接使用SSL/TLS
为用户启用双因素认证
监控身份验证日志

Performance Optimization

性能优化

Resource Allocation

资源分配

CPU:

Don't overcommit CPU cores excessively
Use CPU limits for non-critical VMs
Pin CPUs for latency-sensitive workloads
Monitor CPU steal time

Memory:

Enable ballooning for dynamic allocation
Set appropriate memory limits
Monitor swap usage (should be minimal)
Use hugepages for large memory VMs

Disk I/O:

Use virtio-scsi for best performance
Enable discard/TRIM for SSDs
Configure appropriate I/O scheduler
Monitor disk latency and throughput

CPU：

不要过度超额分配CPU核心
为非关键VM设置CPU限制
为延迟敏感型工作负载绑定CPU
监控CPU窃取时间

内存：

启用气球技术以实现动态分配
设置合适的内存限制
监控交换分区使用情况（应尽可能少）
为大内存VM使用巨页

磁盘I/O：

使用virtio-scsi以获得最佳性能
为SSD启用discard/TRIM
配置合适的I/O调度器
监控磁盘延迟和吞吐量

Monitoring Strategy

监控策略

Key metrics:

Node CPU, memory, disk usage
VM resource consumption
Storage performance (IOPS, latency)
Network throughput
Task completion times

Monitoring tools:

Built-in Proxmox metrics (RRD data)
External monitoring (Prometheus, Grafana)
Log aggregation (syslog, ELK stack)
Alerting for critical thresholds

关键指标：

节点CPU、内存、磁盘使用率
VM资源消耗
存储性能（IOPS、延迟）
网络吞吐量
任务完成时间

监控工具：

Proxmox内置指标（RRD数据）
外部监控（Prometheus、Grafana）
日志聚合（syslog、ELK栈）
关键阈值告警

Operational Workflows

运维工作流

For detailed step-by-step workflows, see:

proxmox-workflows.md - Common operational patterns

For troubleshooting details, see:

proxmox-troubleshooting.md - API quirks and solutions

如需详细的分步工作流，请参见：

proxmox-workflows.md——常见运维模式

如需详细故障排查内容，请参见：

proxmox-troubleshooting.md——API特性及解决方案

Quick Reference

快速参考

VM States

VM状态

running: VM is powered on
stopped: VM is powered off
paused: VM execution suspended
suspended: VM state saved to disk

running：VM已开机
stopped：VM已关机
paused：VM执行被暂停
suspended：VM状态已保存至磁盘

Storage Types

存储类型

dir: Directory-based storage
lvm: LVM volume groups
zfs: ZFS pools
ceph: Ceph RBD
nfs: NFS shares
iscsi: iSCSI targets

dir：基于目录的存储
lvm：LVM卷组
zfs：ZFS池
ceph：Ceph RBD
nfs：NFS共享
iscsi：iSCSI目标

Backup Modes

备份模式

snapshot: Fast, requires storage support
suspend: Pauses VM during backup
stop: Stops VM for backup

snapshot：速度快，需要存储支持
suspend：备份期间暂停VM
stop：停止VM以执行备份

HA States

HA状态

started: Running on assigned node
stopped: Intentionally stopped
fence: Node fenced, restarting elsewhere
error: HA manager error

started：在指定节点运行
stopped：被有意停止
fence：节点被隔离，将在其他节点重启
error：HA管理器出错

License

许可证

MIT License - Part of @bldg-7/proxmox-mcp package

MIT许可证——属于@bldg-7/proxmox-mcp包