dt-obs-hosts
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseInfrastructure Hosts Skill
基础设施主机Skill
Monitor and manage host and process infrastructure including CPU, memory, disk, network, and technology inventory.
监控和管理主机与进程基础设施,包括CPU、内存、磁盘、网络以及技术资产清单。
What This Skill Does
本Skill的功能
- Discover and inventory hosts across cloud and on-premise environments
- Monitor host resource utilization (CPU, memory, disk, network)
- Track process resource consumption and lifecycle
- Analyze container and Kubernetes infrastructure
- Discover services via listening ports
- Manage technology stack versions and compliance
- Attribute infrastructure costs by cost center and product
- Validate data quality and metadata completeness
- Plan capacity and detect resource saturation
- Correlate infrastructure health across layers
- 发现并盘点云端和本地环境中的所有主机
- 监控主机资源利用率(CPU、内存、磁盘、网络)
- 追踪进程资源消耗情况和生命周期
- 分析容器和Kubernetes基础设施
- 通过监听端口发现服务
- 管理技术栈版本和合规性
- 按成本中心和产品归属核算基础设施成本
- 验证数据质量和元数据完整性
- 规划容量并检测资源饱和情况
- 跨层级关联基础设施健康状态
When to Use This Skill
适用场景
Use this skill when the user needs to:
- Inventory: "Show me all Linux hosts in AWS us-east-1"
- Monitor: "What hosts have high CPU usage?"
- Troubleshoot: "Which processes are consuming the most memory?"
- Discover: "What databases are running in production?"
- Plan: "Track Kubernetes version distribution for upgrade planning"
- Cost: "Calculate infrastructure costs by cost center"
- Security: "Find all processes listening on port 22"
- Compliance: "Identify hosts running EOL Java versions"
- Quality: "Check data completeness for AWS hosts"
- Optimize: "Find rightsizing candidates based on utilization"
当用户需要完成以下操作时可使用本Skill:
- 资产盘点: "展示AWS us-east-1区域的所有Linux主机"
- 监控: "哪些主机的CPU使用率较高?"
- 故障排查: "哪些进程消耗的内存最多?"
- 服务发现: "生产环境中运行了哪些数据库?"
- 规划: "追踪Kubernetes版本分布以制定升级计划"
- 成本核算: "按成本中心计算基础设施成本"
- 安全: "找出所有监听22端口的进程"
- 合规: "识别运行已停止维护(EOL)Java版本的主机"
- 质量检查: "检查AWS主机的数据完整性"
- 优化: "根据资源利用率找出可调整规格的候选资源"
Core Concepts
核心概念
Entities
实体
- HOST - Physical or virtual machines (cloud or on-premise)
- PROCESS - Running processes and process groups
- CONTAINER - Kubernetes containers
- NETWORK_INTERFACE - Host network interfaces
- DISK - Host disk volumes
- HOST - 物理机或虚拟机(云端或本地部署)
- PROCESS - 运行中的进程和进程组
- CONTAINER - Kubernetes容器
- NETWORK_INTERFACE - 主机网络接口
- DISK - 主机磁盘卷
Metrics Categories
指标分类
- Host Metrics - ,
dt.host.cpu.*,dt.host.memory.*,dt.host.disk.*dt.host.net.* - Process Metrics - ,
dt.process.cpu.*,dt.process.memory.*,dt.process.io.*dt.process.network.* - Inventory - OS type, cloud provider, technology stack, versions
- Cost - ,
dt.cost.costcenterdt.cost.product - Quality - Metadata completeness, version compliance
- 主机指标 - 、
dt.host.cpu.*、dt.host.memory.*、dt.host.disk.*dt.host.net.* - 进程指标 - 、
dt.process.cpu.*、dt.process.memory.*、dt.process.io.*dt.process.network.* - 资产清单 - 操作系统类型、云厂商、技术栈、版本
- 成本 - 、
dt.cost.costcenterdt.cost.product - 质量 - 元数据完整性、版本合规性
Alert Thresholds
告警阈值
- CPU/Memory/Disk: 80% warning, 90% critical
- Network: >70% high, >85% saturated
- Disk Latency: >20ms bottleneck
- Network Errors: Drop rate >1%, error rate >0.1%
- Swap: >30% warning, >50% critical
- CPU/内存/磁盘: 使用率达80%触发预警,达90%触发严重告警
- 网络: 使用率>70%为高负载,>85%为饱和状态
- 磁盘延迟: >20ms为性能瓶颈
- 网络错误: 丢包率>1%、错误率>0.1%
- 交换分区: 使用率>30%触发预警,>50%触发严重告警
Key Workflows
核心工作流
1. Host Discovery and Classification
1. 主机发现与分类
Discover hosts, classify by OS/cloud, inventory resources.
dql
smartscapeNodes "HOST"
| fieldsAdd os.type, cloud.provider, host.logical.cpu.cores, host.physical.memory
| summarize host_count = count(), by: {os.type, cloud.provider}
| sort host_count descOS Types: , , , ,
LINUXWINDOWSAIXSOLARISZOS→ For cloud-specific attributes, see references/inventory-discovery.md
发现主机,按操作系统/云厂商分类,盘点资源。
dql
smartscapeNodes "HOST"
| fieldsAdd os.type, cloud.provider, host.logical.cpu.cores, host.physical.memory
| summarize host_count = count(), by: {os.type, cloud.provider}
| sort host_count desc操作系统类型: 、、、、
LINUXWINDOWSAIXSOLARISZOS→ 查看云厂商专属属性,请参考 references/inventory-discovery.md
2. Resource Utilization Monitoring
2. 资源利用率监控
Monitor CPU, memory, disk, network across hosts.
dql
timeseries {
cpu = avg(dt.host.cpu.usage),
memory = avg(dt.host.memory.usage),
disk = avg(dt.host.disk.used.percent)
}, by: {dt.smartscape.host}
| fieldsAdd host_name = getNodeName(dt.smartscape.host)
| filter arrayAvg(cpu) > 80 or arrayAvg(memory) > 80
| sort arrayAvg(cpu) descHigh utilization threshold: 80% warning, 90% critical
→ For detailed CPU analysis, see references/host-metrics.md
→ For memory breakdown, see references/host-metrics.md
→ For memory breakdown, see references/host-metrics.md
监控所有主机的CPU、内存、磁盘、网络情况。
dql
timeseries {
cpu = avg(dt.host.cpu.usage),
memory = avg(dt.host.memory.usage),
disk = avg(dt.host.disk.used.percent)
}, by: {dt.smartscape.host}
| fieldsAdd host_name = getNodeName(dt.smartscape.host)
| filter arrayAvg(cpu) > 80 or arrayAvg(memory) > 80
| sort arrayAvg(cpu) desc高利用率阈值: 80%触发预警,90%触发严重告警
3. Process Resource Analysis
3. 进程资源分析
Identify top resource consumers at process level.
dql
timeseries {
cpu = avg(dt.process.cpu.usage),
memory = avg(dt.process.memory.usage)
}, by: {dt.smartscape.process}
| fieldsAdd process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(cpu) > 50
| sort arrayAvg(cpu) desc
| limit 20→ For process I/O analysis, see references/process-monitoring.md
→ For process network metrics, see references/process-monitoring.md
→ For process network metrics, see references/process-monitoring.md
在进程层面识别资源消耗最高的对象。
dql
timeseries {
cpu = avg(dt.process.cpu.usage),
memory = avg(dt.process.memory.usage)
}, by: {dt.smartscape.process}
| fieldsAdd process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(cpu) > 50
| sort arrayAvg(cpu) desc
| limit 204. Technology Stack Inventory
4. 技术栈资产盘点
Discover and track software technologies and versions.
dql
smartscapeNodes "PROCESS"
| fieldsAdd process.software_technologies
| expand tech = process.software_technologies
| fieldsAdd tech_type = tech[type], tech_version = tech[version]
| summarize process_count = count(), by: {tech_type, tech_version}
| sort process_count descCommon Technologies: Java, Node.js, Python, .NET, databases, web servers, messaging systems
→ For version compliance checks, see references/inventory-discovery.md
发现并追踪软件技术和版本。
dql
smartscapeNodes "PROCESS"
| fieldsAdd process.software_technologies
| expand tech = process.software_technologies
| fieldsAdd tech_type = tech[type], tech_version = tech[version]
| summarize process_count = count(), by: {tech_type, tech_version}
| sort process_count desc常见技术: Java、Node.js、Python、.NET、数据库、Web服务器、消息系统
→ 查看版本合规检查方法,请参考 references/inventory-discovery.md
5. Service Discovery via Ports
5. 基于端口的服务发现
Map listening ports to services for security and inventory.
dql
smartscapeNodes "PROCESS"
| fieldsAdd process.listen_ports, dt.process_group.detected_name
| filter isNotNull(process.listen_ports) and arraySize(process.listen_ports) > 0
| expand port = process.listen_ports
| summarize process_count = count(), by: {port, dt.process_group.detected_name}
| sort toLong(port) asc
| limit 50Well-known ports: 80 (HTTP), 443 (HTTPS), 22 (SSH), 3306 (MySQL), 5432 (PostgreSQL)
→ For comprehensive port mapping, see references/inventory-discovery.md
将监听端口映射到服务,用于安全检查和资产盘点。
dql
smartscapeNodes "PROCESS"
| fieldsAdd process.listen_ports, dt.process_group.detected_name
| filter isNotNull(process.listen_ports) and arraySize(process.listen_ports) > 0
| expand port = process.listen_ports
| summarize process_count = count(), by: {port, dt.process_group.detected_name}
| sort toLong(port) asc
| limit 50知名端口: 80(HTTP)、443(HTTPS)、22(SSH)、3306(MySQL)、5432(PostgreSQL)
→ 查看完整端口映射方法,请参考 references/inventory-discovery.md
6. Container and Kubernetes Monitoring
6. 容器和Kubernetes监控
Track container distribution and K8s workload types.
dql
smartscapeNodes "CONTAINER"
| fieldsAdd k8s.cluster.name, k8s.namespace.name, k8s.workload.kind
| summarize container_count = count(), by: {k8s.cluster.name, k8s.workload.kind}
| sort k8s.cluster.name, container_count descWorkload Types: , , , ,
deploymentdaemonsetstatefulsetjobcronjobNote: Container image names/versions NOT available in smartscape.
→ For K8s version tracking, see references/container-monitoring.md
→ For container lifecycle, see references/container-monitoring.md
→ For container lifecycle, see references/container-monitoring.md
追踪容器分布和K8s工作负载类型。
dql
smartscapeNodes "CONTAINER"
| fieldsAdd k8s.cluster.name, k8s.namespace.name, k8s.workload.kind
| summarize container_count = count(), by: {k8s.cluster.name, k8s.workload.kind}
| sort k8s.cluster.name, container_count desc工作负载类型: 、、、、
deploymentdaemonsetstatefulsetjobcronjob注意: Smartscape中不提供容器镜像名称/版本数据。
→ 查看K8s版本追踪方法,请参考 references/container-monitoring.md
→ 查看容器生命周期管理方法,请参考 references/container-monitoring.md
→ 查看容器生命周期管理方法,请参考 references/container-monitoring.md
7. Cost Attribution and Chargeback
7. 成本归属与分摊
Calculate infrastructure costs by cost center.
dql
smartscapeNodes "HOST"
| fieldsAdd dt.cost.costcenter, host.logical.cpu.cores, host.physical.memory
| filter isNotNull(dt.cost.costcenter)
| fieldsAdd memory_gb = toDouble(host.physical.memory) / 1024 / 1024 / 1024
| summarize
host_count = count(),
total_cores = sum(toLong(host.logical.cpu.cores)),
total_memory_gb = sum(memory_gb),
by: {dt.cost.costcenter}
| sort total_cores desc→ For product-level cost tracking, see references/inventory-discovery.md
按成本中心计算基础设施成本。
dql
smartscapeNodes "HOST"
| fieldsAdd dt.cost.costcenter, host.logical.cpu.cores, host.physical.memory
| filter isNotNull(dt.cost.costcenter)
| fieldsAdd memory_gb = toDouble(host.physical.memory) / 1024 / 1024 / 1024
| summarize
host_count = count(),
total_cores = sum(toLong(host.logical.cpu.cores)),
total_memory_gb = sum(memory_gb),
by: {dt.cost.costcenter}
| sort total_cores desc→ 查看产品级成本追踪方法,请参考 references/inventory-discovery.md
8. Infrastructure Health Correlation
8. 基础设施健康状态关联
Correlate host and process metrics for cross-layer analysis.
dql
timeseries {
host_cpu = avg(dt.host.cpu.usage),
host_memory = avg(dt.host.memory.usage),
process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}
| fieldsAdd
host_name = getNodeName(dt.smartscape.host),
process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(host_cpu) > 70
| sort arrayAvg(host_cpu) descHealth scoring: Critical if any resource >90%, warning if >80%
→ For multi-resource saturation detection, see references/host-metrics.md
关联主机和进程指标,实现跨层级分析。
dql
timeseries {
host_cpu = avg(dt.host.cpu.usage),
host_memory = avg(dt.host.memory.usage),
process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}
| fieldsAdd
host_name = getNodeName(dt.smartscape.host),
process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(host_cpu) > 70
| sort arrayAvg(host_cpu) desc健康评分规则: 任意资源使用率>90%为严重状态,>80%为预警状态
→ 查看多资源饱和检测方法,请参考 references/host-metrics.md
Common Query Patterns
常用查询模式
Pattern 1: Smartscape Discovery
模式1:Smartscape发现
Use to discover and classify entities.
smartscapeNodesdql
smartscapeNodes "HOST"
| fieldsAdd <attributes>
| filter <conditions>
| summarize <aggregations>使用发现并分类实体。
smartscapeNodesdql
smartscapeNodes "HOST"
| fieldsAdd <attributes>
| filter <conditions>
| summarize <aggregations>Pattern 2: Timeseries Performance
模式2:时序性能分析
Use to analyze metrics over time.
timeseriesdql
timeseries metric = avg(dt.host.<metric>), by: {dt.smartscape.host}
| fieldsAdd <calculations>
| filter <thresholds>使用分析随时间变化的指标。
timeseriesdql
timeseries metric = avg(dt.host.<metric>), by: {dt.smartscape.host}
| fieldsAdd <calculations>
| filter <thresholds>Pattern 3: Cross-Layer Correlation
模式3:跨层级关联
Correlate host and process metrics.
dql
timeseries {
host_cpu = avg(dt.host.cpu.usage),
process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}关联主机和进程指标。
dql
timeseries {
host_cpu = avg(dt.host.cpu.usage),
process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}Pattern 4: Entity Enrichment with Lookup
模式4:通过Lookup实现实体 enrichment
Enrich data with entity attributes. After , reference fields with prefix.
lookuplookup.dql
timeseries cpu = avg(dt.host.cpu.usage), by: {dt.smartscape.host}
| lookup [
smartscapeNodes HOST
| fields id, cpuCores, memoryTotal
], sourceField:dt.smartscape.host, lookupField:id
| fieldsAdd cores = lookup.cpuCores, mem_gb = lookup.memoryTotal / 1024 / 1024 / 1024使用实体属性丰富数据。操作后,需使用前缀引用字段。
lookuplookup.dql
timeseries cpu = avg(dt.host.cpu.usage), by: {dt.smartscape.host}
| lookup [
smartscapeNodes HOST
| fields id, cpuCores, memoryTotal
], sourceField:dt.smartscape.host, lookupField:id
| fieldsAdd cores = lookup.cpuCores, mem_gb = lookup.memoryTotal / 1024 / 1024 / 1024Tags and Metadata
标签与元数据
Important Notes
重要说明
- Generic field is NOT populated in smartscape queries
tags - Use specific tag fields: ,
tags:azure[*]tags:environment - Use custom metadata:
host.custom.metadata[*]
- Smartscape查询中不会填充通用字段
tags - 使用特定标签字段:、
tags:azure[*]tags:environment - 使用自定义元数据:
host.custom.metadata[*]
Available Tags
可用标签
- Azure Tags: ,
tags:azure[dt_owner_team]tags:azure[dt_cloudcost_capability] - Environment:
tags:environment - Custom Metadata: ,
host.custom.metadata[OperatorVersion]host.custom.metadata[Cluster] - Cost: ,
dt.cost.costcenterdt.cost.product
→ For complete tag reference, see references/inventory-discovery.md
- Azure标签: 、
tags:azure[dt_owner_team]tags:azure[dt_cloudcost_capability] - 环境标签:
tags:environment - 自定义元数据: 、
host.custom.metadata[OperatorVersion]host.custom.metadata[Cluster] - 成本标签: 、
dt.cost.costcenterdt.cost.product
→ 查看完整标签参考,请参考 references/inventory-discovery.md
Cloud-Specific Attributes
云厂商专属属性
AWS
AWS
cloud.provider == "aws"- ,
aws.region,aws.availability_zoneaws.account.id - ,
aws.resource.idaws.resource.name - (running, stopped, terminated)
aws.state
cloud.provider == "aws"- 、
aws.region、aws.availability_zoneaws.account.id - 、
aws.resource.idaws.resource.name - (running、stopped、terminated)
aws.state
Azure
Azure
cloud.provider == "azure"- ,
azure.location,azure.subscriptionazure.resource.group - ,
azure.statusazure.provisioning_state - (VM size)
azure.resource.sku.name
cloud.provider == "azure"- 、
azure.location、azure.subscriptionazure.resource.group - 、
azure.statusazure.provisioning_state - (虚拟机规格)
azure.resource.sku.name
Kubernetes
Kubernetes
- ,
k8s.cluster.namek8s.cluster.uid - ,
k8s.namespace.name,k8s.node.namek8s.pod.name - ,
k8s.workload.namek8s.workload.kind
→ For multi-cloud analysis, see references/inventory-discovery.md
- 、
k8s.cluster.namek8s.cluster.uid - 、
k8s.namespace.name、k8s.node.namek8s.pod.name - 、
k8s.workload.namek8s.workload.kind
→ 查看多云分析方法,请参考 references/inventory-discovery.md
Best Practices
最佳实践
Alerting
告警设置
- Use percentiles (p95, p99) for latency metrics
- Use for resource limits
max() - Use for utilization trends
avg() - Set multi-level thresholds (warning at 80%, critical at 90%)
- 延迟指标使用百分位数(p95、p99)
- 资源限制使用统计
max() - 利用率趋势使用统计
avg() - 设置多级阈值(80%预警、90%严重告警)
Time Windows
时间窗口选择
- Real-time: 5-15 minute windows
- Trends: 24 hours to 7 days
- Capacity planning: 30-90 days
- 实时监控: 5-15分钟窗口
- 趋势分析: 24小时至7天
- 容量规划: 30-90天
Query Optimization
查询优化
- Use filters early in the pipeline
- Limit results with
| limit N - Use specific entity types in smartscapeNodes
- Aggregate before enrichment (lookup)
- 在查询管道早期使用过滤条件
- 使用限制返回结果数量
| limit N - 在smartscapeNodes中指定具体实体类型
- 先聚合再进行enrichment(lookup)操作
Data Quality
数据质量
- Validate metadata completeness (target >90%)
- Check for duplicate host names
- Ensure cost tag coverage
- Monitor data freshness (lifetime.end)
- 验证元数据完整性(目标>90%)
- 检查重复主机名
- 确保成本标签覆盖度
- 监控数据新鲜度(lifetime.end)
Limitations and Notes
限制与注意事项
Smartscape Limitations
Smartscape限制
- Container image names/versions NOT available in smartscape
- Generic field NOT populated (use specific tag namespaces)
tags - Process metadata varies by process type
- Smartscape中不提供容器镜像名称/版本数据
- 不会填充通用字段(需使用特定标签命名空间)
tags - 进程元数据随进程类型不同存在差异
Platform-Specific
平台专属限制
- available on Linux only
dt.host.cpu.iowait - AIX has specific CPU metrics (entitlement, physc)
- Inode metrics available on Linux only
- 仅支持Linux系统
dt.host.cpu.iowait - AIX有专属CPU指标(entitlement、physc)
- Inode指标仅支持Linux系统
Best Practices
使用建议
- Use to get human-readable names
getNodeName() - Convert bytes to GB for readability:
/ 1024 / 1024 / 1024 - Round aggregated values:
round(value, decimals: 1) - Use checks before array operations
isNotNull()
- 使用获取人类可读的名称
getNodeName() - 将字节转换为GB提升可读性:
/ 1024 / 1024 / 1024 - 对聚合值取整:
round(value, decimals: 1) - 数组操作前先使用检查
isNotNull()
When to Load References
何时加载参考文档
This skill uses progressive disclosure. Start here for 80% of use cases. Load reference files for detailed specifications when needed.
本Skill采用渐进式披露设计,80%的使用场景可通过本文档覆盖,需要详细规范时可加载参考文件。
Load host-metrics.md when:
满足以下需求时加载host-metrics.md:
- Analyzing CPU component breakdown (user, system, iowait, steal)
- Investigating memory pressure and swap usage
- Troubleshooting disk I/O latency
- Diagnosing network packet drops or errors
- 分析CPU组件细分占比(用户态、系统态、iowait、steal)
- 排查内存压力和交换分区使用情况
- 排查磁盘I/O延迟问题
- 诊断网络丢包或错误问题
Load process-monitoring.md when:
满足以下需求时加载process-monitoring.md:
- Analyzing process-level I/O patterns
- Investigating TCP connection quality
- Detecting resource exhaustion (file descriptors, threads)
- Tracking GC suspension time
- 分析进程级I/O模式
- 排查TCP连接质量问题
- 检测资源耗尽问题(文件描述符、线程)
- 追踪GC暂停时间
Load container-monitoring.md when:
满足以下需求时加载container-monitoring.md:
- Analyzing container lifecycle and churn
- Tracking Kubernetes version distribution
- Managing OneAgent operator versions
- Planning K8s cluster upgrades
- 分析容器生命周期和 churn 情况
- 追踪Kubernetes版本分布
- 管理OneAgent operator版本
- 规划K8s集群升级
Load inventory-discovery.md when:
满足以下需求时加载inventory-discovery.md:
- Performing security audits via port discovery
- Implementing cost attribution and chargeback
- Validating data quality and metadata completeness
- Managing multi-cloud infrastructure
- 通过端口发现执行安全审计
- 实现成本归属和分摊
- 验证数据质量和元数据完整性
- 管理多云基础设施
References
参考文档
- host-metrics.md - Detailed host CPU, memory, disk, and network monitoring
- process-monitoring.md - Process-level CPU, memory, I/O, and network analysis
- container-monitoring.md - Container inventory, Kubernetes versions, and operator management
- inventory-discovery.md - Host/process discovery, technology inventory, cost attribution, and data quality
- host-metrics.md - 详细的主机CPU、内存、磁盘和网络监控说明
- process-monitoring.md - 进程级CPU、内存、I/O和网络分析说明
- container-monitoring.md - 容器资产盘点、Kubernetes版本和operator管理说明
- inventory-discovery.md - 主机/进程发现、技术资产盘点、成本归属和数据质量说明