dt-obs-kubernetes

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Infrastructure Kubernetes

Kubernetes基础设施

Monitor and analyze Kubernetes infrastructure using Dynatrace DQL. Query cluster resources, monitor workload health, analyze pod placement, optimize costs, and assess security posture.
使用Dynatrace DQL监控和分析Kubernetes基础设施。可查询集群资源、监控工作负载健康状态、分析Pod调度位置、优化成本以及评估安全态势。

When to Use This Skill

何时使用此技能

  • Monitoring Kubernetes cluster health and capacity
  • Analyzing pod and container resource utilization
  • Investigating pod failures, OOMKills, evictions, or crash loops
  • Debugging degraded deployments, stuck rollouts, or node pressure
  • Optimizing Kubernetes resource costs
  • Assessing security posture and compliance
  • Troubleshooting workload scheduling and placement
  • Auditing ingress routing and network policies
  • 监控Kubernetes集群健康度和容量
  • 分析Pod和容器资源利用率
  • 排查Pod故障、OOMKills、驱逐或崩溃循环
  • 调试降级的部署、卡住的发布或节点压力问题
  • 优化Kubernetes资源成本
  • 评估安全态势和合规性
  • 排查工作负载调度和部署位置问题
  • 审计Ingress路由和网络策略

Reference Files

参考文件

FileContents
references/cluster-inventory.md
Clusters, namespaces, resource distribution
references/labels-annotations.md
Labels, annotations, k8s.object parsing patterns
references/pod-node-placement.md
Node selectors, affinity, taints, HA scheduling
references/pod-debugging.md
Exit codes, pod conditions, init containers, image pull errors, logs, service→pod drill-down
references/workload-health.md
Degraded deployments, stuck rollouts, node conditions, CPU throttling, HPA, StatefulSet ordering
references/pv-pvc.md
PVC/PV lifecycle, phase reference, orphaned volumes, StorageClass
references/ingress.md
Routing rule parsing, TLS audit
references/network-policies.md
Policy listing, namespace isolation audit
文件内容
references/cluster-inventory.md
集群、命名空间、资源分布
references/labels-annotations.md
标签、注解、k8s.object解析模式
references/pod-node-placement.md
节点选择器、亲和性、污点、HA调度
references/pod-debugging.md
退出码、Pod状态、初始化容器、镜像拉取错误、日志、服务→Pod下钻
references/workload-health.md
降级部署、卡住的发布、节点状态、CPU限流、HPA、StatefulSet排序
references/pv-pvc.md
PVC/PV生命周期、阶段参考、孤立卷、StorageClass
references/ingress.md
路由规则解析、TLS审计
references/network-policies.md
策略列表、命名空间隔离审计

Key Concepts

核心概念

Entity Types

实体类型

Workloads:
K8S_DEPLOYMENT
,
K8S_STATEFULSET
,
K8S_DAEMONSET
,
K8S_JOB
,
K8S_CRONJOB
,
K8S_HORIZONTALPODAUTOSCALER

Infrastructure:
K8S_CLUSTER
,
K8S_NAMESPACE
,
K8S_NODE
,
K8S_POD

Configuration:
K8S_SERVICE
,
K8S_CONFIGMAP
,
K8S_SECRET
,
K8S_PERSISTENTVOLUMECLAIM
,
K8S_PERSISTENTVOLUME
,
K8S_INGRESS
,
K8S_NETWORKPOLICY
工作负载:
K8S_DEPLOYMENT
K8S_STATEFULSET
K8S_DAEMONSET
K8S_JOB
K8S_CRONJOB
K8S_HORIZONTALPODAUTOSCALER

基础设施:
K8S_CLUSTER
K8S_NAMESPACE
K8S_NODE
K8S_POD

配置:
K8S_SERVICE
K8S_CONFIGMAP
K8S_SECRET
K8S_PERSISTENTVOLUMECLAIM
K8S_PERSISTENTVOLUME
K8S_INGRESS
K8S_NETWORKPOLICY

Query Types

查询类型

smartscapeNodes - Query K8s entities:
dql
smartscapeNodes K8S_POD
| filter k8s.namespace.name == "production"
| fields k8s.cluster.name, k8s.pod.name
timeseries - Monitor metrics over time:
dql
timeseries cpu = sum(dt.kubernetes.container.cpu_usage),
  by: {k8s.pod.name, k8s.namespace.name}
| fieldsAdd avg_cpu = arrayAvg(cpu)
fetch logs - Analyze log events:
dql
fetch logs
| filter k8s.namespace.name == "production" and loglevel == "ERROR"
smartscapeNodes - 查询K8s实体:
dql
smartscapeNodes K8S_POD
| filter k8s.namespace.name == "production"
| fields k8s.cluster.name, k8s.pod.name
timeseries - 监控时间维度的指标:
dql
timeseries cpu = sum(dt.kubernetes.container.cpu_usage),
  by: {k8s.pod.name, k8s.namespace.name}
| fieldsAdd avg_cpu = arrayAvg(cpu)
fetch logs - 分析日志事件:
dql
fetch logs
| filter k8s.namespace.name == "production" and loglevel == "ERROR"

Core Fields

核心字段

  • k8s.cluster.name
    ,
    k8s.namespace.name
    ,
    k8s.pod.name
    ,
    k8s.node.name
  • k8s.workload.name
    ,
    k8s.workload.kind
    ,
    k8s.container.name
  • k8s.object
    - Full JSON configuration for deep inspection
  • tags[label]
    - Access labels and annotations
  • k8s.cluster.name
    k8s.namespace.name
    k8s.pod.name
    k8s.node.name
  • k8s.workload.name
    k8s.workload.kind
    k8s.container.name
  • k8s.object
    - 用于深度排查的完整JSON配置
  • tags[label]
    - 访问标签和注解

Available Metrics

可用指标

CPU:
dt.kubernetes.container.cpu_usage
,
cpu_throttled
,
limits_cpu
,
requests_cpu

Memory:
dt.kubernetes.container.memory_working_set
,
limits_memory
,
requests_memory

Operations:
dt.kubernetes.container.restarts
,
oom_kills

Node:
dt.kubernetes.node.pods_allocatable
,
cpu_allocatable
,
memory_allocatable
,
dt.kubernetes.pods
CPU:
dt.kubernetes.container.cpu_usage
cpu_throttled
limits_cpu
requests_cpu

内存:
dt.kubernetes.container.memory_working_set
limits_memory
requests_memory

运维:
dt.kubernetes.container.restarts
oom_kills

节点:
dt.kubernetes.node.pods_allocatable
cpu_allocatable
memory_allocatable
dt.kubernetes.pods

Entity Disambiguation

实体区分

K8S_POD
vs
CONTAINER
: these are different entity types in Dynatrace.
  • K8S_POD
    — K8s-native entities with
    k8s.object
    JSON, scheduling state, conditions, and K8s metrics. Use this skill.
  • CONTAINER
    — Host-level container inventory (image, lifetime, host assignment). Use
    dt-obs-hosts
    skill instead.
The smartscape edge is
CONTAINER --(is_part_of)--> K8S_POD
. To reach containers from a pod, traverse backward:
dql
smartscapeNodes K8S_POD
| filter k8s.namespace.name == "<namespace>"
| traverse edgeTypes: {is_part_of}, targetTypes: {CONTAINER}, direction: backward, fieldsKeep: {id}
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, container.id=id
K8S_POD
CONTAINER
:这是Dynatrace中不同的实体类型。
  • K8S_POD
    — K8s原生实体,包含
    k8s.object
    JSON、调度状态、运行状态和K8s指标,本技能适用该类型。
  • CONTAINER
    — 主机级容器资产(镜像、生命周期、主机分配),请使用
    dt-obs-hosts
    技能查询。
smartscape关联关系为
CONTAINER --(is_part_of)--> K8S_POD
。要从Pod查询关联的容器,可反向遍历:
dql
smartscapeNodes K8S_POD
| filter k8s.namespace.name == "<namespace>"
| traverse edgeTypes: {is_part_of}, targetTypes: {CONTAINER}, direction: backward, fieldsKeep: {id}
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, container.id=id

Service → K8S_POD Correlation

服务与K8S_POD关联

No direct smartscape edge exists between
SERVICE
and
K8S_POD
. The correlation key is the shared dimension
k8s.workload.name
. See Service → Pod Drill-Down in
references/pod-debugging.md
for the full two-step pattern.
SERVICE
K8S_POD
之间没有直接的smartscape关联边,关联键为共享维度
k8s.workload.name
。完整的两步关联模式请查看
references/pod-debugging.md
中的服务→Pod下钻章节。

Common Workflows

常用工作流

1. Cluster Health Check

1. 集群健康检查

List all clusters:
dql
smartscapeNodes K8S_CLUSTER
| fields k8s.cluster.name, k8s.cluster.version, k8s.cluster.distribution
Check node capacity:
dql
timeseries {
  current_pods = avg(dt.kubernetes.pods),
  max_pods = avg(dt.kubernetes.node.pods_allocatable)
}, by: {k8s.node.name, k8s.cluster.name}
| fieldsAdd pod_capacity_pct = (arrayAvg(current_pods) / arrayAvg(max_pods)) * 100
| filter pod_capacity_pct > 80
Identify pods in non-Running state:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| fieldsAdd phase = config[status][phase]
| filter phase != "Running"
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, phase
列出所有集群:
dql
smartscapeNodes K8S_CLUSTER
| fields k8s.cluster.name, k8s.cluster.version, k8s.cluster.distribution
检查节点容量:
dql
timeseries {
  current_pods = avg(dt.kubernetes.pods),
  max_pods = avg(dt.kubernetes.node.pods_allocatable)
}, by: {k8s.node.name, k8s.cluster.name}
| fieldsAdd pod_capacity_pct = (arrayAvg(current_pods) / arrayAvg(max_pods)) * 100
| filter pod_capacity_pct > 80
识别非Running状态的Pod:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| fieldsAdd phase = config[status][phase]
| filter phase != "Running"
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, phase

2. Resource Optimization

2. 资源优化

Find over-provisioned pods (usage < 30%):
dql
timeseries {
  cpu_usage = sum(dt.kubernetes.container.cpu_usage),
  cpu_requests = avg(dt.kubernetes.container.requests_cpu)
}, by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd usage_pct = (arrayAvg(cpu_usage) / arrayAvg(cpu_requests)) * 100
| filter usage_pct < 30 and arrayAvg(cpu_requests) > 0
Identify containers without limits:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    cpu_limit = container[resources][limits][cpu],
    memory_limit = container[resources][limits][memory]
| filter isNull(cpu_limit) or isNull(memory_limit)
查找配置过剩的Pod(使用率<30%):
dql
timeseries {
  cpu_usage = sum(dt.kubernetes.container.cpu_usage),
  cpu_requests = avg(dt.kubernetes.container.requests_cpu)
}, by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd usage_pct = (arrayAvg(cpu_usage) / arrayAvg(cpu_requests)) * 100
| filter usage_pct < 30 and arrayAvg(cpu_requests) > 0
识别未配置资源限制的容器:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    cpu_limit = container[resources][limits][cpu],
    memory_limit = container[resources][limits][memory]
| filter isNull(cpu_limit) or isNull(memory_limit)

3. Troubleshooting Pod Issues

3. Pod问题排查

Find pods with OOMKills:
dql
timeseries oom_kills = sum(dt.kubernetes.container.oom_kills),
  by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| filter arraySum(oom_kills) > 0
| fieldsAdd total_oom_kills = arraySum(oom_kills)
| sort total_oom_kills desc
Analyze pod restart patterns:
dql
timeseries restarts = sum(dt.kubernetes.container.restarts),
  by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd total_restarts = arraySum(restarts)
| filter total_restarts > 5
查找出现OOMKills的Pod:
dql
timeseries oom_kills = sum(dt.kubernetes.container.oom_kills),
  by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| filter arraySum(oom_kills) > 0
| fieldsAdd total_oom_kills = arraySum(oom_kills)
| sort total_oom_kills desc
分析Pod重启模式:
dql
timeseries restarts = sum(dt.kubernetes.container.restarts),
  by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd total_restarts = arraySum(restarts)
| filter total_restarts > 5

4. Security Assessment

4. 安全评估

Identify privileged containers:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    privileged = container[securityContext][privileged]
| filter privileged == true
Find containers running as root:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    run_as_user = container[securityContext][runAsUser],
    run_as_non_root = container[securityContext][runAsNonRoot]
| filter (isNull(run_as_user) or run_as_user == 0) and run_as_non_root != true
识别特权容器:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    privileged = container[securityContext][privileged]
| filter privileged == true
查找以root身份运行的容器:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
    container_name = container[name],
    run_as_user = container[securityContext][runAsUser],
    run_as_non_root = container[securityContext][runAsNonRoot]
| filter (isNull(run_as_user) or run_as_user == 0) and run_as_non_root != true

5. Scheduling Analysis

5. 调度分析

Verify pod distribution (HA compliance):
dql
smartscapeNodes K8S_POD
| filter k8s.workload.kind == "deployment"
| summarize pod_count = count(),
            node_count = countDistinct(k8s.node.name),
            by: {k8s.cluster.name, k8s.namespace.name, k8s.workload.name}
| fieldsAdd ha_compliant = node_count > 1
| filter pod_count >= 2 and not ha_compliant
验证Pod分布(高可用合规):
dql
smartscapeNodes K8S_POD
| filter k8s.workload.kind == "deployment"
| summarize pod_count = count(),
            node_count = countDistinct(k8s.node.name),
            by: {k8s.cluster.name, k8s.namespace.name, k8s.workload.name}
| fieldsAdd ha_compliant = node_count > 1
| filter pod_count >= 2 and not ha_compliant

6. DAVIS Problems affecting K8s Entities

6. 影响K8s实体的DAVIS问题

Find active DAVIS problems affecting K8s entities:
dql
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(smartscape.affected_entity.types, "K8S_")
| fields display_id, event.name, event.category, smartscape.affected_entity.ids
Use entries
smartscape.affected_entity.ids
(array of Smartscape IDs) to look up the affected entity using its Smartscape ID.
查找影响K8s实体的活跃DAVIS问题:
dql
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(smartscape.affected_entity.types, "K8S_")
| fields display_id, event.name, event.category, smartscape.affected_entity.ids
使用
smartscape.affected_entity.ids
(Smartscape ID数组)可通过Smartscape ID查询受影响的实体详情。

Best Practices

最佳实践

Query Performance

查询性能

  1. Filter early - Apply cluster/namespace filters immediately
  2. Use specific entity types - Avoid wildcards
  3. Limit result sets - Use
    limit
    for exploration
  1. 提前过滤 - 优先应用集群/命名空间过滤条件
  2. 使用指定实体类型 - 避免使用通配符
  3. 限制结果集 - 探索查询时使用
    limit
    限制返回数量

Monitoring Recommendations

监控建议

  1. Set resource limits on all containers
  2. Monitor OOMKills and adjust memory limits
  3. Track CPU throttling and adjust CPU limits
  4. Review resource efficiency regularly (target 70-80%)
  5. Implement security best practices (non-root, read-only filesystem)
  6. Use specific image tags (avoid :latest)
  1. 为所有容器配置资源限制
  2. 监控OOMKills并调整内存限制
  3. 跟踪CPU限流并调整CPU限制
  4. 定期复盘资源效率(目标70-80%利用率)
  5. 落地安全最佳实践(非root运行、只读文件系统)
  6. 使用明确的镜像标签(避免使用:latest)

Configuration Standards

配置规范

  1. Use labels for organization (app, environment, team)
  2. Set resource requests and limits
  3. Configure health checks (liveness/readiness probes)
  4. Use TLS for all ingress resources
  5. Document with annotations
  1. 使用标签做资源归类(应用、环境、团队)
  2. 配置资源请求和限制
  3. 配置健康检查(存活/就绪探针)
  4. 所有Ingress资源使用TLS
  5. 使用注解做好文档说明

Limitations

局限

Unavailable Metrics:
  • Pod network metrics (rx_bytes, tx_bytes) are NOT available in Grail
  • Workaround: Use service mesh metrics or host-level network metrics
Query Considerations:
  • Minimize result set size: Do not include the
    k8s.object
    field if not necessary
  • Keep result set as simple as possible: Parsing k8s.object increases query complexity
  • Large clusters may require pagination or time-range limits
  • Some K8s status fields update asynchronously
不可用指标:
  • Pod网络指标(rx_bytes、tx_bytes)在Grail中暂不支持
  • 替代方案:使用服务网格指标或主机级网络指标
查询注意事项:
  • 减小结果集大小:非必要场景不要包含
    k8s.object
    字段
  • 保持结果集尽可能简单:解析k8s.object会增加查询复杂度
  • 大型集群查询可能需要分页或限制时间范围
  • 部分K8s状态字段为异步更新