dt-obs-hosts

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Infrastructure Hosts Skill

基础设施主机Skill

Monitor and manage host and process infrastructure including CPU, memory, disk, network, and technology inventory.

监控和管理主机与进程基础设施，包括CPU、内存、磁盘、网络以及技术资产清单。

What This Skill Does

本Skill的功能

Discover and inventory hosts across cloud and on-premise environments
Monitor host resource utilization (CPU, memory, disk, network)
Track process resource consumption and lifecycle
Analyze container and Kubernetes infrastructure
Discover services via listening ports
Manage technology stack versions and compliance
Attribute infrastructure costs by cost center and product
Validate data quality and metadata completeness
Plan capacity and detect resource saturation
Correlate infrastructure health across layers

发现并盘点云端和本地环境中的所有主机
监控主机资源利用率（CPU、内存、磁盘、网络）
追踪进程资源消耗情况和生命周期
分析容器和Kubernetes基础设施
通过监听端口发现服务
管理技术栈版本和合规性
按成本中心和产品归属核算基础设施成本
验证数据质量和元数据完整性
规划容量并检测资源饱和情况
跨层级关联基础设施健康状态

When to Use This Skill

适用场景

Use this skill when the user needs to:

Inventory: "Show me all Linux hosts in AWS us-east-1"
Monitor: "What hosts have high CPU usage?"
Troubleshoot: "Which processes are consuming the most memory?"
Discover: "What databases are running in production?"
Plan: "Track Kubernetes version distribution for upgrade planning"
Cost: "Calculate infrastructure costs by cost center"
Security: "Find all processes listening on port 22"
Compliance: "Identify hosts running EOL Java versions"
Quality: "Check data completeness for AWS hosts"
Optimize: "Find rightsizing candidates based on utilization"

当用户需要完成以下操作时可使用本Skill：

资产盘点： "展示AWS us-east-1区域的所有Linux主机"
监控： "哪些主机的CPU使用率较高？"
故障排查： "哪些进程消耗的内存最多？"
服务发现： "生产环境中运行了哪些数据库？"
规划： "追踪Kubernetes版本分布以制定升级计划"
成本核算： "按成本中心计算基础设施成本"
安全： "找出所有监听22端口的进程"
合规： "识别运行已停止维护（EOL）Java版本的主机"
质量检查： "检查AWS主机的数据完整性"
优化： "根据资源利用率找出可调整规格的候选资源"

Core Concepts

核心概念

Entities

实体

HOST - Physical or virtual machines (cloud or on-premise)
PROCESS - Running processes and process groups
CONTAINER - Kubernetes containers
NETWORK_INTERFACE - Host network interfaces
DISK - Host disk volumes

HOST - 物理机或虚拟机（云端或本地部署）
PROCESS - 运行中的进程和进程组
CONTAINER - Kubernetes容器
NETWORK_INTERFACE - 主机网络接口
DISK - 主机磁盘卷

Metrics Categories

指标分类

Host Metrics -

dt.host.cpu.*

dt.host.memory.*

dt.host.disk.*

dt.host.net.*

Process Metrics -

dt.process.cpu.*

dt.process.memory.*

dt.process.io.*

dt.process.network.*

Inventory - OS type, cloud provider, technology stack, versions
Cost -
```
dt.cost.costcenter
```
,
```
dt.cost.product
```
Quality - Metadata completeness, version compliance

主机指标 -

dt.host.cpu.*

、

dt.host.memory.*

、

dt.host.disk.*

、

dt.host.net.*

进程指标 -

dt.process.cpu.*

、

dt.process.memory.*

、

dt.process.io.*

、

dt.process.network.*

资产清单 - 操作系统类型、云厂商、技术栈、版本
成本 -
```
dt.cost.costcenter
```
、
```
dt.cost.product
```
质量 - 元数据完整性、版本合规性

Alert Thresholds

告警阈值

CPU/Memory/Disk: 80% warning, 90% critical
Network: >70% high, >85% saturated
Disk Latency: >20ms bottleneck
Network Errors: Drop rate >1%, error rate >0.1%
Swap: >30% warning, >50% critical

CPU/内存/磁盘： 使用率达80%触发预警，达90%触发严重告警
网络： 使用率>70%为高负载，>85%为饱和状态
磁盘延迟： >20ms为性能瓶颈
网络错误： 丢包率>1%、错误率>0.1%
交换分区： 使用率>30%触发预警，>50%触发严重告警

Key Workflows

核心工作流

1. Host Discovery and Classification

1. 主机发现与分类

Discover hosts, classify by OS/cloud, inventory resources.

dql

smartscapeNodes "HOST"
| fieldsAdd os.type, cloud.provider, host.logical.cpu.cores, host.physical.memory
| summarize host_count = count(), by: {os.type, cloud.provider}
| sort host_count desc

OS Types:

LINUX

WINDOWS

AIX

SOLARIS

ZOS

→ For cloud-specific attributes, see references/inventory-discovery.md

发现主机，按操作系统/云厂商分类，盘点资源。

dql

smartscapeNodes "HOST"
| fieldsAdd os.type, cloud.provider, host.logical.cpu.cores, host.physical.memory
| summarize host_count = count(), by: {os.type, cloud.provider}
| sort host_count desc

操作系统类型：

LINUX

、

WINDOWS

、

AIX

、

SOLARIS

、

ZOS

→ 查看云厂商专属属性，请参考 references/inventory-discovery.md

2. Resource Utilization Monitoring

2. 资源利用率监控

Monitor CPU, memory, disk, network across hosts.

dql

timeseries {
  cpu = avg(dt.host.cpu.usage),
  memory = avg(dt.host.memory.usage),
  disk = avg(dt.host.disk.used.percent)
}, by: {dt.smartscape.host}
| fieldsAdd host_name = getNodeName(dt.smartscape.host)
| filter arrayAvg(cpu) > 80 or arrayAvg(memory) > 80
| sort arrayAvg(cpu) desc

High utilization threshold: 80% warning, 90% critical

→ For detailed CPU analysis, see references/host-metrics.md
→ For memory breakdown, see references/host-metrics.md

监控所有主机的CPU、内存、磁盘、网络情况。

dql

timeseries {
  cpu = avg(dt.host.cpu.usage),
  memory = avg(dt.host.memory.usage),
  disk = avg(dt.host.disk.used.percent)
}, by: {dt.smartscape.host}
| fieldsAdd host_name = getNodeName(dt.smartscape.host)
| filter arrayAvg(cpu) > 80 or arrayAvg(memory) > 80
| sort arrayAvg(cpu) desc

高利用率阈值： 80%触发预警，90%触发严重告警

→ 查看详细CPU分析，请参考 references/host-metrics.md
→ 查看内存细分数据，请参考 references/host-metrics.md

3. Process Resource Analysis

3. 进程资源分析

Identify top resource consumers at process level.

dql

timeseries {
  cpu = avg(dt.process.cpu.usage),
  memory = avg(dt.process.memory.usage)
}, by: {dt.smartscape.process}
| fieldsAdd process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(cpu) > 50
| sort arrayAvg(cpu) desc
| limit 20

→ For process I/O analysis, see references/process-monitoring.md
→ For process network metrics, see references/process-monitoring.md

在进程层面识别资源消耗最高的对象。

dql

timeseries {
  cpu = avg(dt.process.cpu.usage),
  memory = avg(dt.process.memory.usage)
}, by: {dt.smartscape.process}
| fieldsAdd process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(cpu) > 50
| sort arrayAvg(cpu) desc
| limit 20

→ 查看进程I/O分析，请参考 references/process-monitoring.md
→ 查看进程网络指标，请参考 references/process-monitoring.md

4. Technology Stack Inventory

4. 技术栈资产盘点

Discover and track software technologies and versions.

dql

smartscapeNodes "PROCESS"
| fieldsAdd process.software_technologies
| expand tech = process.software_technologies
| fieldsAdd tech_type = tech[type], tech_version = tech[version]
| summarize process_count = count(), by: {tech_type, tech_version}
| sort process_count desc

Common Technologies: Java, Node.js, Python, .NET, databases, web servers, messaging systems

→ For version compliance checks, see references/inventory-discovery.md

发现并追踪软件技术和版本。

dql

smartscapeNodes "PROCESS"
| fieldsAdd process.software_technologies
| expand tech = process.software_technologies
| fieldsAdd tech_type = tech[type], tech_version = tech[version]
| summarize process_count = count(), by: {tech_type, tech_version}
| sort process_count desc

常见技术： Java、Node.js、Python、.NET、数据库、Web服务器、消息系统

→ 查看版本合规检查方法，请参考 references/inventory-discovery.md

5. Service Discovery via Ports

5. 基于端口的服务发现

Map listening ports to services for security and inventory.

dql

smartscapeNodes "PROCESS"
| fieldsAdd process.listen_ports, dt.process_group.detected_name
| filter isNotNull(process.listen_ports) and arraySize(process.listen_ports) > 0
| expand port = process.listen_ports
| summarize process_count = count(), by: {port, dt.process_group.detected_name}
| sort toLong(port) asc
| limit 50

Well-known ports: 80 (HTTP), 443 (HTTPS), 22 (SSH), 3306 (MySQL), 5432 (PostgreSQL)

→ For comprehensive port mapping, see references/inventory-discovery.md

将监听端口映射到服务，用于安全检查和资产盘点。

dql

smartscapeNodes "PROCESS"
| fieldsAdd process.listen_ports, dt.process_group.detected_name
| filter isNotNull(process.listen_ports) and arraySize(process.listen_ports) > 0
| expand port = process.listen_ports
| summarize process_count = count(), by: {port, dt.process_group.detected_name}
| sort toLong(port) asc
| limit 50

知名端口： 80（HTTP）、443（HTTPS）、22（SSH）、3306（MySQL）、5432（PostgreSQL）

→ 查看完整端口映射方法，请参考 references/inventory-discovery.md

6. Container and Kubernetes Monitoring

6. 容器和Kubernetes监控

Track container distribution and K8s workload types.

dql

smartscapeNodes "CONTAINER"
| fieldsAdd k8s.cluster.name, k8s.namespace.name, k8s.workload.kind
| summarize container_count = count(), by: {k8s.cluster.name, k8s.workload.kind}
| sort k8s.cluster.name, container_count desc

Workload Types:

deployment

daemonset

statefulset

job

cronjob

Note: Container image names/versions NOT available in smartscape.

→ For K8s version tracking, see references/container-monitoring.md
→ For container lifecycle, see references/container-monitoring.md

追踪容器分布和K8s工作负载类型。

dql

smartscapeNodes "CONTAINER"
| fieldsAdd k8s.cluster.name, k8s.namespace.name, k8s.workload.kind
| summarize container_count = count(), by: {k8s.cluster.name, k8s.workload.kind}
| sort k8s.cluster.name, container_count desc

工作负载类型：

deployment

、

daemonset

、

statefulset

、

job

、

cronjob

注意： Smartscape中不提供容器镜像名称/版本数据。

→ 查看K8s版本追踪方法，请参考 references/container-monitoring.md
→ 查看容器生命周期管理方法，请参考 references/container-monitoring.md

7. Cost Attribution and Chargeback

7. 成本归属与分摊

Calculate infrastructure costs by cost center.

dql

smartscapeNodes "HOST"
| fieldsAdd dt.cost.costcenter, host.logical.cpu.cores, host.physical.memory
| filter isNotNull(dt.cost.costcenter)
| fieldsAdd memory_gb = toDouble(host.physical.memory) / 1024 / 1024 / 1024
| summarize 
    host_count = count(),
    total_cores = sum(toLong(host.logical.cpu.cores)),
    total_memory_gb = sum(memory_gb),
    by: {dt.cost.costcenter}
| sort total_cores desc

→ For product-level cost tracking, see references/inventory-discovery.md

按成本中心计算基础设施成本。

dql

smartscapeNodes "HOST"
| fieldsAdd dt.cost.costcenter, host.logical.cpu.cores, host.physical.memory
| filter isNotNull(dt.cost.costcenter)
| fieldsAdd memory_gb = toDouble(host.physical.memory) / 1024 / 1024 / 1024
| summarize 
    host_count = count(),
    total_cores = sum(toLong(host.logical.cpu.cores)),
    total_memory_gb = sum(memory_gb),
    by: {dt.cost.costcenter}
| sort total_cores desc

→ 查看产品级成本追踪方法，请参考 references/inventory-discovery.md

8. Infrastructure Health Correlation

8. 基础设施健康状态关联

Correlate host and process metrics for cross-layer analysis.

dql

timeseries {
  host_cpu = avg(dt.host.cpu.usage),
  host_memory = avg(dt.host.memory.usage),
  process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}
| fieldsAdd
    host_name = getNodeName(dt.smartscape.host),
    process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(host_cpu) > 70
| sort arrayAvg(host_cpu) desc

Health scoring: Critical if any resource >90%, warning if >80%

→ For multi-resource saturation detection, see references/host-metrics.md

关联主机和进程指标，实现跨层级分析。

dql

timeseries {
  host_cpu = avg(dt.host.cpu.usage),
  host_memory = avg(dt.host.memory.usage),
  process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}
| fieldsAdd
    host_name = getNodeName(dt.smartscape.host),
    process_name = getNodeName(dt.smartscape.process)
| filter arrayAvg(host_cpu) > 70
| sort arrayAvg(host_cpu) desc

健康评分规则： 任意资源使用率>90%为严重状态，>80%为预警状态

→ 查看多资源饱和检测方法，请参考 references/host-metrics.md

Common Query Patterns

常用查询模式

Pattern 1: Smartscape Discovery

模式1：Smartscape发现

Use

smartscapeNodes

to discover and classify entities.

dql

smartscapeNodes "HOST"
| fieldsAdd <attributes>
| filter <conditions>
| summarize <aggregations>

使用

smartscapeNodes

发现并分类实体。

dql

smartscapeNodes "HOST"
| fieldsAdd <attributes>
| filter <conditions>
| summarize <aggregations>

Pattern 2: Timeseries Performance

模式2：时序性能分析

Use

timeseries

to analyze metrics over time.

dql

timeseries metric = avg(dt.host.<metric>), by: {dt.smartscape.host}
| fieldsAdd <calculations>
| filter <thresholds>

使用

timeseries

分析随时间变化的指标。

dql

timeseries metric = avg(dt.host.<metric>), by: {dt.smartscape.host}
| fieldsAdd <calculations>
| filter <thresholds>

Pattern 3: Cross-Layer Correlation

模式3：跨层级关联

Correlate host and process metrics.

dql

timeseries {
  host_cpu = avg(dt.host.cpu.usage),
  process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}

关联主机和进程指标。

dql

timeseries {
  host_cpu = avg(dt.host.cpu.usage),
  process_cpu = avg(dt.process.cpu.usage)
}, by: {dt.smartscape.host, dt.smartscape.process}

Pattern 4: Entity Enrichment with Lookup

模式4：通过Lookup实现实体 enrichment

Enrich data with entity attributes. After

lookup

, reference fields with

lookup.

prefix.

dql

timeseries cpu = avg(dt.host.cpu.usage), by: {dt.smartscape.host}
| lookup [
    smartscapeNodes HOST
    | fields id, cpuCores, memoryTotal
  ], sourceField:dt.smartscape.host, lookupField:id
| fieldsAdd cores = lookup.cpuCores, mem_gb = lookup.memoryTotal / 1024 / 1024 / 1024

使用实体属性丰富数据。

lookup

操作后，需使用

lookup.

前缀引用字段。

dql

timeseries cpu = avg(dt.host.cpu.usage), by: {dt.smartscape.host}
| lookup [
    smartscapeNodes HOST
    | fields id, cpuCores, memoryTotal
  ], sourceField:dt.smartscape.host, lookupField:id
| fieldsAdd cores = lookup.cpuCores, mem_gb = lookup.memoryTotal / 1024 / 1024 / 1024

Tags and Metadata

标签与元数据

Important Notes

重要说明

Generic
```
tags
```
field is NOT populated in smartscape queries
Use specific tag fields:
```
tags:azure[*]
```
,
```
tags:environment
```
Use custom metadata:
```
host.custom.metadata[*]
```

Smartscape查询中不会填充通用
```
tags
```
字段
使用特定标签字段：
```
tags:azure[*]
```
、
```
tags:environment
```
使用自定义元数据：
```
host.custom.metadata[*]
```

Available Tags

可用标签

Azure Tags:

tags:azure[dt_owner_team]

tags:azure[dt_cloudcost_capability]

Environment:
```
tags:environment
```

Custom Metadata:

host.custom.metadata[OperatorVersion]

host.custom.metadata[Cluster]

Cost:
```
dt.cost.costcenter
```
,
```
dt.cost.product
```

→ For complete tag reference, see references/inventory-discovery.md

Azure标签：

tags:azure[dt_owner_team]

、

tags:azure[dt_cloudcost_capability]

环境标签：
```
tags:environment
```

自定义元数据：

host.custom.metadata[OperatorVersion]

、

host.custom.metadata[Cluster]

成本标签：
```
dt.cost.costcenter
```
、
```
dt.cost.product
```

→ 查看完整标签参考，请参考 references/inventory-discovery.md

Cloud-Specific Attributes

云厂商专属属性

AWS

```
cloud.provider == "aws"
```

aws.region

aws.availability_zone

aws.account.id

```
aws.resource.id
```
,
```
aws.resource.name
```
```
aws.state
```
(running, stopped, terminated)

```
cloud.provider == "aws"
```

aws.region

、

aws.availability_zone

、

aws.account.id

```
aws.resource.id
```
、
```
aws.resource.name
```
```
aws.state
```
（running、stopped、terminated）

Azure

```
cloud.provider == "azure"
```

azure.location

azure.subscription

azure.resource.group

```
azure.status
```
,
```
azure.provisioning_state
```
```
azure.resource.sku.name
```
(VM size)

```
cloud.provider == "azure"
```

azure.location

、

azure.subscription

、

azure.resource.group

```
azure.status
```
、
```
azure.provisioning_state
```
```
azure.resource.sku.name
```
（虚拟机规格）

Kubernetes

```
k8s.cluster.name
```
,
```
k8s.cluster.uid
```

k8s.namespace.name

k8s.node.name

k8s.pod.name

```
k8s.workload.name
```
,
```
k8s.workload.kind
```

→ For multi-cloud analysis, see references/inventory-discovery.md

```
k8s.cluster.name
```
、
```
k8s.cluster.uid
```

k8s.namespace.name

、

k8s.node.name

、

k8s.pod.name

```
k8s.workload.name
```
、
```
k8s.workload.kind
```

→ 查看多云分析方法，请参考 references/inventory-discovery.md

Best Practices

最佳实践

Alerting

告警设置

Use percentiles (p95, p99) for latency metrics
Use
```
max()
```
for resource limits
Use
```
avg()
```
for utilization trends
Set multi-level thresholds (warning at 80%, critical at 90%)

延迟指标使用百分位数（p95、p99）
资源限制使用
```
max()
```
统计
利用率趋势使用
```
avg()
```
统计
设置多级阈值（80%预警、90%严重告警）

Time Windows

时间窗口选择

Real-time: 5-15 minute windows
Trends: 24 hours to 7 days
Capacity planning: 30-90 days

实时监控： 5-15分钟窗口
趋势分析： 24小时至7天
容量规划： 30-90天

Query Optimization

查询优化

Use filters early in the pipeline
Limit results with
```
| limit N
```
Use specific entity types in smartscapeNodes
Aggregate before enrichment (lookup)

在查询管道早期使用过滤条件
使用
```
| limit N
```
限制返回结果数量
在smartscapeNodes中指定具体实体类型
先聚合再进行enrichment（lookup）操作

Data Quality

数据质量

Validate metadata completeness (target >90%)
Check for duplicate host names
Ensure cost tag coverage
Monitor data freshness (lifetime.end)

验证元数据完整性（目标>90%）
检查重复主机名
确保成本标签覆盖度
监控数据新鲜度（lifetime.end）

Limitations and Notes

限制与注意事项

Smartscape Limitations

Smartscape限制

Container image names/versions NOT available in smartscape
Generic
```
tags
```
field NOT populated (use specific tag namespaces)
Process metadata varies by process type

Smartscape中不提供容器镜像名称/版本数据
不会填充通用
```
tags
```
字段（需使用特定标签命名空间）
进程元数据随进程类型不同存在差异

Platform-Specific

平台专属限制

```
dt.host.cpu.iowait
```
available on Linux only
AIX has specific CPU metrics (entitlement, physc)
Inode metrics available on Linux only

```
dt.host.cpu.iowait
```
仅支持Linux系统
AIX有专属CPU指标（entitlement、physc）
Inode指标仅支持Linux系统

Best Practices

使用建议

Use
```
getNodeName()
```
to get human-readable names
Convert bytes to GB for readability:
```
/ 1024 / 1024 / 1024
```
Round aggregated values:
```
round(value, decimals: 1)
```
Use
```
isNotNull()
```
checks before array operations

使用
```
getNodeName()
```
获取人类可读的名称
将字节转换为GB提升可读性：
```
/ 1024 / 1024 / 1024
```
对聚合值取整：
```
round(value, decimals: 1)
```
数组操作前先使用
```
isNotNull()
```
检查

When to Load References

何时加载参考文档

This skill uses progressive disclosure. Start here for 80% of use cases. Load reference files for detailed specifications when needed.

本Skill采用渐进式披露设计，80%的使用场景可通过本文档覆盖，需要详细规范时可加载参考文件。

Load host-metrics.md when:

满足以下需求时加载host-metrics.md：

Analyzing CPU component breakdown (user, system, iowait, steal)
Investigating memory pressure and swap usage
Troubleshooting disk I/O latency
Diagnosing network packet drops or errors

分析CPU组件细分占比（用户态、系统态、iowait、steal）
排查内存压力和交换分区使用情况
排查磁盘I/O延迟问题
诊断网络丢包或错误问题

Load process-monitoring.md when:

满足以下需求时加载process-monitoring.md：

Analyzing process-level I/O patterns
Investigating TCP connection quality
Detecting resource exhaustion (file descriptors, threads)
Tracking GC suspension time

分析进程级I/O模式
排查TCP连接质量问题
检测资源耗尽问题（文件描述符、线程）
追踪GC暂停时间

Load container-monitoring.md when:

满足以下需求时加载container-monitoring.md：

Analyzing container lifecycle and churn
Tracking Kubernetes version distribution
Managing OneAgent operator versions
Planning K8s cluster upgrades

分析容器生命周期和 churn 情况
追踪Kubernetes版本分布
管理OneAgent operator版本
规划K8s集群升级

Load inventory-discovery.md when:

满足以下需求时加载inventory-discovery.md：

Performing security audits via port discovery
Implementing cost attribution and chargeback
Validating data quality and metadata completeness
Managing multi-cloud infrastructure

通过端口发现执行安全审计
实现成本归属和分摊
验证数据质量和元数据完整性
管理多云基础设施

References

参考文档

host-metrics.md - Detailed host CPU, memory, disk, and network monitoring
process-monitoring.md - Process-level CPU, memory, I/O, and network analysis
container-monitoring.md - Container inventory, Kubernetes versions, and operator management
inventory-discovery.md - Host/process discovery, technology inventory, cost attribution, and data quality

host-metrics.md - 详细的主机CPU、内存、磁盘和网络监控说明
process-monitoring.md - 进程级CPU、内存、I/O和网络分析说明
container-monitoring.md - 容器资产盘点、Kubernetes版本和operator管理说明
inventory-discovery.md - 主机/进程发现、技术资产盘点、成本归属和数据质量说明