dt-obs-kubernetes

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Infrastructure Kubernetes

Kubernetes基础设施

Monitor and analyze Kubernetes infrastructure using Dynatrace DQL. Query cluster resources, monitor workload health, analyze pod placement, optimize costs, and assess security posture.

使用Dynatrace DQL监控和分析Kubernetes基础设施。可查询集群资源、监控工作负载健康状态、分析Pod调度位置、优化成本以及评估安全态势。

When to Use This Skill

何时使用此技能

Monitoring Kubernetes cluster health and capacity
Analyzing pod and container resource utilization
Investigating pod failures, OOMKills, evictions, or crash loops
Debugging degraded deployments, stuck rollouts, or node pressure
Optimizing Kubernetes resource costs
Assessing security posture and compliance
Troubleshooting workload scheduling and placement
Auditing ingress routing and network policies

监控Kubernetes集群健康度和容量
分析Pod和容器资源利用率
排查Pod故障、OOMKills、驱逐或崩溃循环
调试降级的部署、卡住的发布或节点压力问题
优化Kubernetes资源成本
评估安全态势和合规性
排查工作负载调度和部署位置问题
审计Ingress路由和网络策略

Reference Files

参考文件

File	Contents
`references/cluster-inventory.md`	Clusters, namespaces, resource distribution
`references/labels-annotations.md`	Labels, annotations, k8s.object parsing patterns
`references/pod-node-placement.md`	Node selectors, affinity, taints, HA scheduling
`references/pod-debugging.md`	Exit codes, pod conditions, init containers, image pull errors, logs, service→pod drill-down
`references/workload-health.md`	Degraded deployments, stuck rollouts, node conditions, CPU throttling, HPA, StatefulSet ordering
`references/pv-pvc.md`	PVC/PV lifecycle, phase reference, orphaned volumes, StorageClass
`references/ingress.md`	Routing rule parsing, TLS audit
`references/network-policies.md`	Policy listing, namespace isolation audit

文件	内容
`references/cluster-inventory.md`	集群、命名空间、资源分布
`references/labels-annotations.md`	标签、注解、k8s.object解析模式
`references/pod-node-placement.md`	节点选择器、亲和性、污点、HA调度
`references/pod-debugging.md`	退出码、Pod状态、初始化容器、镜像拉取错误、日志、服务→Pod下钻
`references/workload-health.md`	降级部署、卡住的发布、节点状态、CPU限流、HPA、StatefulSet排序
`references/pv-pvc.md`	PVC/PV生命周期、阶段参考、孤立卷、StorageClass
`references/ingress.md`	路由规则解析、TLS审计
`references/network-policies.md`	策略列表、命名空间隔离审计

Key Concepts

核心概念

Entity Types

实体类型

Workloads:

K8S_DEPLOYMENT

K8S_STATEFULSET

K8S_DAEMONSET

K8S_JOB

K8S_CRONJOB

K8S_HORIZONTALPODAUTOSCALER

Infrastructure:

K8S_CLUSTER

K8S_NAMESPACE

K8S_NODE

K8S_POD

Configuration:

K8S_SERVICE

K8S_CONFIGMAP

K8S_SECRET

K8S_PERSISTENTVOLUMECLAIM

K8S_PERSISTENTVOLUME

K8S_INGRESS

K8S_NETWORKPOLICY

工作负载：

K8S_DEPLOYMENT

、

K8S_STATEFULSET

、

K8S_DAEMONSET

、

K8S_JOB

、

K8S_CRONJOB

、

K8S_HORIZONTALPODAUTOSCALER

基础设施：

K8S_CLUSTER

、

K8S_NAMESPACE

、

K8S_NODE

、

K8S_POD

配置：

K8S_SERVICE

、

K8S_CONFIGMAP

、

K8S_SECRET

、

K8S_PERSISTENTVOLUMECLAIM

、

K8S_PERSISTENTVOLUME

、

K8S_INGRESS

、

K8S_NETWORKPOLICY

Query Types

查询类型

smartscapeNodes - Query K8s entities:

dql

smartscapeNodes K8S_POD
| filter k8s.namespace.name == "production"
| fields k8s.cluster.name, k8s.pod.name

timeseries - Monitor metrics over time:

dql

timeseries cpu = sum(dt.kubernetes.container.cpu_usage),
  by: {k8s.pod.name, k8s.namespace.name}
| fieldsAdd avg_cpu = arrayAvg(cpu)

fetch logs - Analyze log events:

dql

fetch logs
| filter k8s.namespace.name == "production" and loglevel == "ERROR"

smartscapeNodes - 查询K8s实体：

dql

smartscapeNodes K8S_POD
| filter k8s.namespace.name == "production"
| fields k8s.cluster.name, k8s.pod.name

timeseries - 监控时间维度的指标：

dql

timeseries cpu = sum(dt.kubernetes.container.cpu_usage),
  by: {k8s.pod.name, k8s.namespace.name}
| fieldsAdd avg_cpu = arrayAvg(cpu)

fetch logs - 分析日志事件：

dql

fetch logs
| filter k8s.namespace.name == "production" and loglevel == "ERROR"

Core Fields

核心字段

k8s.cluster.name

k8s.namespace.name

k8s.pod.name

k8s.node.name

k8s.workload.name

k8s.workload.kind

k8s.container.name

```
k8s.object
```
- Full JSON configuration for deep inspection
```
tags[label]
```
- Access labels and annotations

k8s.cluster.name

、

k8s.namespace.name

、

k8s.pod.name

、

k8s.node.name

k8s.workload.name

、

k8s.workload.kind

、

k8s.container.name

```
k8s.object
```
- 用于深度排查的完整JSON配置
```
tags[label]
```
- 访问标签和注解

Available Metrics

可用指标

CPU:

dt.kubernetes.container.cpu_usage

cpu_throttled

limits_cpu

requests_cpu

Memory:

dt.kubernetes.container.memory_working_set

limits_memory

requests_memory

Operations:

dt.kubernetes.container.restarts

oom_kills

Node:

dt.kubernetes.node.pods_allocatable

cpu_allocatable

memory_allocatable

dt.kubernetes.pods

CPU：

dt.kubernetes.container.cpu_usage

、

cpu_throttled

、

limits_cpu

、

requests_cpu

内存：

dt.kubernetes.container.memory_working_set

、

limits_memory

、

requests_memory

运维：

dt.kubernetes.container.restarts

、

oom_kills

节点：

dt.kubernetes.node.pods_allocatable

、

cpu_allocatable

、

memory_allocatable

、

dt.kubernetes.pods

Entity Disambiguation

实体区分

K8S_POD

CONTAINER

: these are different entity types in Dynatrace.

K8S_POD
— K8s-native entities with
```
k8s.object
```
JSON, scheduling state, conditions, and K8s metrics. Use this skill.
CONTAINER
— Host-level container inventory (image, lifetime, host assignment). Use
```
dt-obs-hosts
```
skill instead.

The smartscape edge is

CONTAINER --(is_part_of)--> K8S_POD

. To reach containers from a pod, traverse backward:

dql

smartscapeNodes K8S_POD
| filter k8s.namespace.name == "<namespace>"
| traverse edgeTypes: {is_part_of}, targetTypes: {CONTAINER}, direction: backward, fieldsKeep: {id}
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, container.id=id

K8S_POD

和

CONTAINER

：这是Dynatrace中不同的实体类型。

K8S_POD
— K8s原生实体，包含
```
k8s.object
```
JSON、调度状态、运行状态和K8s指标，本技能适用该类型。
CONTAINER
— 主机级容器资产（镜像、生命周期、主机分配），请使用
```
dt-obs-hosts
```
技能查询。

smartscape关联关系为

CONTAINER --(is_part_of)--> K8S_POD

。要从Pod查询关联的容器，可反向遍历：

dql

smartscapeNodes K8S_POD
| filter k8s.namespace.name == "<namespace>"
| traverse edgeTypes: {is_part_of}, targetTypes: {CONTAINER}, direction: backward, fieldsKeep: {id}
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, container.id=id

Service → K8S_POD Correlation

服务与K8S_POD关联

No direct smartscape edge exists between

SERVICE

and

K8S_POD

. The correlation key is the shared dimension

k8s.workload.name

. See Service → Pod Drill-Down in

references/pod-debugging.md

for the full two-step pattern.

SERVICE

和

K8S_POD

之间没有直接的smartscape关联边，关联键为共享维度

k8s.workload.name

。完整的两步关联模式请查看

references/pod-debugging.md

中的服务→Pod下钻章节。

Common Workflows

常用工作流

1. Cluster Health Check

1. 集群健康检查

List all clusters:

dql

smartscapeNodes K8S_CLUSTER
| fields k8s.cluster.name, k8s.cluster.version, k8s.cluster.distribution

Check node capacity:

dql

timeseries {
  current_pods = avg(dt.kubernetes.pods),
  max_pods = avg(dt.kubernetes.node.pods_allocatable)
}, by: {k8s.node.name, k8s.cluster.name}
| fieldsAdd pod_capacity_pct = (arrayAvg(current_pods) / arrayAvg(max_pods)) * 100
| filter pod_capacity_pct > 80

Identify pods in non-Running state:

dql

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| fieldsAdd phase = config[status][phase]
| filter phase != "Running"
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, phase

列出所有集群：

dql

smartscapeNodes K8S_CLUSTER
| fields k8s.cluster.name, k8s.cluster.version, k8s.cluster.distribution

检查节点容量：

dql

timeseries {
  current_pods = avg(dt.kubernetes.pods),
  max_pods = avg(dt.kubernetes.node.pods_allocatable)
}, by: {k8s.node.name, k8s.cluster.name}
| fieldsAdd pod_capacity_pct = (arrayAvg(current_pods) / arrayAvg(max_pods)) * 100
| filter pod_capacity_pct > 80

识别非Running状态的Pod：

dql

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| fieldsAdd phase = config[status][phase]
| filter phase != "Running"
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, phase

2. Resource Optimization

2. 资源优化

Find over-provisioned pods (usage < 30%):

dql

timeseries {
  cpu_usage = sum(dt.kubernetes.container.cpu_usage),
  cpu_requests = avg(dt.kubernetes.container.requests_cpu)
}, by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd usage_pct = (arrayAvg(cpu_usage) / arrayAvg(cpu_requests)) * 100
| filter usage_pct < 30 and arrayAvg(cpu_requests) > 0

Identify containers without limits:

dql

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    cpu_limit = container[resources][limits][cpu],
    memory_limit = container[resources][limits][memory]
| filter isNull(cpu_limit) or isNull(memory_limit)

查找配置过剩的Pod（使用率<30%）：

dql

timeseries {
  cpu_usage = sum(dt.kubernetes.container.cpu_usage),
  cpu_requests = avg(dt.kubernetes.container.requests_cpu)
}, by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd usage_pct = (arrayAvg(cpu_usage) / arrayAvg(cpu_requests)) * 100
| filter usage_pct < 30 and arrayAvg(cpu_requests) > 0

识别未配置资源限制的容器：

dql

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    cpu_limit = container[resources][limits][cpu],
    memory_limit = container[resources][limits][memory]
| filter isNull(cpu_limit) or isNull(memory_limit)

3. Troubleshooting Pod Issues

3. Pod问题排查

Find pods with OOMKills:

dql

timeseries oom_kills = sum(dt.kubernetes.container.oom_kills),
  by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| filter arraySum(oom_kills) > 0
| fieldsAdd total_oom_kills = arraySum(oom_kills)
| sort total_oom_kills desc

Analyze pod restart patterns:

dql

timeseries restarts = sum(dt.kubernetes.container.restarts),
  by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd total_restarts = arraySum(restarts)
| filter total_restarts > 5

查找出现OOMKills的Pod：

dql

timeseries oom_kills = sum(dt.kubernetes.container.oom_kills),
  by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| filter arraySum(oom_kills) > 0
| fieldsAdd total_oom_kills = arraySum(oom_kills)
| sort total_oom_kills desc

分析Pod重启模式：

dql

timeseries restarts = sum(dt.kubernetes.container.restarts),
  by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd total_restarts = arraySum(restarts)
| filter total_restarts > 5

4. Security Assessment

4. 安全评估

Identify privileged containers:

dql

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    privileged = container[securityContext][privileged]
| filter privileged == true

Find containers running as root:

dql

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    run_as_user = container[securityContext][runAsUser],
    run_as_non_root = container[securityContext][runAsNonRoot]
| filter (isNull(run_as_user) or run_as_user == 0) and run_as_non_root != true

识别特权容器：

dql

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    privileged = container[securityContext][privileged]
| filter privileged == true

查找以root身份运行的容器：

dql

smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    run_as_user = container[securityContext][runAsUser],
    run_as_non_root = container[securityContext][runAsNonRoot]
| filter (isNull(run_as_user) or run_as_user == 0) and run_as_non_root != true

5. Scheduling Analysis

5. 调度分析

Verify pod distribution (HA compliance):

dql

smartscapeNodes K8S_POD
| filter k8s.workload.kind == "deployment"
| summarize pod_count = count(),
            node_count = countDistinct(k8s.node.name),
            by: {k8s.cluster.name, k8s.namespace.name, k8s.workload.name}
| fieldsAdd ha_compliant = node_count > 1
| filter pod_count >= 2 and not ha_compliant

验证Pod分布（高可用合规）：

dql

smartscapeNodes K8S_POD
| filter k8s.workload.kind == "deployment"
| summarize pod_count = count(),
            node_count = countDistinct(k8s.node.name),
            by: {k8s.cluster.name, k8s.namespace.name, k8s.workload.name}
| fieldsAdd ha_compliant = node_count > 1
| filter pod_count >= 2 and not ha_compliant

6. DAVIS Problems affecting K8s Entities

6. 影响K8s实体的DAVIS问题

Find active DAVIS problems affecting K8s entities:

dql

fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(smartscape.affected_entity.types, "K8S_")
| fields display_id, event.name, event.category, smartscape.affected_entity.ids

Use entries

smartscape.affected_entity.ids

(array of Smartscape IDs) to look up the affected entity using its Smartscape ID.

查找影响K8s实体的活跃DAVIS问题：

dql

fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(smartscape.affected_entity.types, "K8S_")
| fields display_id, event.name, event.category, smartscape.affected_entity.ids

使用

smartscape.affected_entity.ids

（Smartscape ID数组）可通过Smartscape ID查询受影响的实体详情。

Best Practices

最佳实践

Query Performance

查询性能

Filter early - Apply cluster/namespace filters immediately
Use specific entity types - Avoid wildcards
Limit result sets - Use
```
limit
```
for exploration

提前过滤 - 优先应用集群/命名空间过滤条件
使用指定实体类型 - 避免使用通配符
限制结果集 - 探索查询时使用
```
limit
```
限制返回数量

Monitoring Recommendations

监控建议

Set resource limits on all containers
Monitor OOMKills and adjust memory limits
Track CPU throttling and adjust CPU limits
Review resource efficiency regularly (target 70-80%)
Implement security best practices (non-root, read-only filesystem)
Use specific image tags (avoid :latest)

为所有容器配置资源限制
监控OOMKills并调整内存限制
跟踪CPU限流并调整CPU限制
定期复盘资源效率（目标70-80%利用率）
落地安全最佳实践（非root运行、只读文件系统）
使用明确的镜像标签（避免使用:latest）

Configuration Standards

配置规范

Use labels for organization (app, environment, team)
Set resource requests and limits
Configure health checks (liveness/readiness probes)
Use TLS for all ingress resources
Document with annotations

使用标签做资源归类（应用、环境、团队）
配置资源请求和限制
配置健康检查（存活/就绪探针）
所有Ingress资源使用TLS
使用注解做好文档说明

Limitations

局限

Unavailable Metrics:

Pod network metrics (rx_bytes, tx_bytes) are NOT available in Grail
Workaround: Use service mesh metrics or host-level network metrics

Query Considerations:

Minimize result set size: Do not include the
```
k8s.object
```
field if not necessary
Keep result set as simple as possible: Parsing k8s.object increases query complexity
Large clusters may require pagination or time-range limits
Some K8s status fields update asynchronously

不可用指标：

Pod网络指标（rx_bytes、tx_bytes）在Grail中暂不支持
替代方案：使用服务网格指标或主机级网络指标

查询注意事项：

减小结果集大小：非必要场景不要包含
```
k8s.object
```
字段
保持结果集尽可能简单：解析k8s.object会增加查询复杂度
大型集群查询可能需要分页或限制时间范围
部分K8s状态字段为异步更新