dt-obs-kubernetes
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseInfrastructure Kubernetes
Kubernetes基础设施
Monitor and analyze Kubernetes infrastructure using Dynatrace DQL. Query
cluster resources, monitor workload health, analyze pod placement, optimize
costs, and assess security posture.
使用Dynatrace DQL监控和分析Kubernetes基础设施。可查询集群资源、监控工作负载健康状态、分析Pod调度位置、优化成本以及评估安全态势。
When to Use This Skill
何时使用此技能
- Monitoring Kubernetes cluster health and capacity
- Analyzing pod and container resource utilization
- Investigating pod failures, OOMKills, evictions, or crash loops
- Debugging degraded deployments, stuck rollouts, or node pressure
- Optimizing Kubernetes resource costs
- Assessing security posture and compliance
- Troubleshooting workload scheduling and placement
- Auditing ingress routing and network policies
- 监控Kubernetes集群健康度和容量
- 分析Pod和容器资源利用率
- 排查Pod故障、OOMKills、驱逐或崩溃循环
- 调试降级的部署、卡住的发布或节点压力问题
- 优化Kubernetes资源成本
- 评估安全态势和合规性
- 排查工作负载调度和部署位置问题
- 审计Ingress路由和网络策略
Reference Files
参考文件
| File | Contents |
|---|---|
| Clusters, namespaces, resource distribution |
| Labels, annotations, k8s.object parsing patterns |
| Node selectors, affinity, taints, HA scheduling |
| Exit codes, pod conditions, init containers, image pull errors, logs, service→pod drill-down |
| Degraded deployments, stuck rollouts, node conditions, CPU throttling, HPA, StatefulSet ordering |
| PVC/PV lifecycle, phase reference, orphaned volumes, StorageClass |
| Routing rule parsing, TLS audit |
| Policy listing, namespace isolation audit |
| 文件 | 内容 |
|---|---|
| 集群、命名空间、资源分布 |
| 标签、注解、k8s.object解析模式 |
| 节点选择器、亲和性、污点、HA调度 |
| 退出码、Pod状态、初始化容器、镜像拉取错误、日志、服务→Pod下钻 |
| 降级部署、卡住的发布、节点状态、CPU限流、HPA、StatefulSet排序 |
| PVC/PV生命周期、阶段参考、孤立卷、StorageClass |
| 路由规则解析、TLS审计 |
| 策略列表、命名空间隔离审计 |
Key Concepts
核心概念
Entity Types
实体类型
Workloads: , , ,
, ,
Infrastructure:, , ,
Configuration:, , ,
, , ,
K8S_DEPLOYMENTK8S_STATEFULSETK8S_DAEMONSETK8S_JOBK8S_CRONJOBK8S_HORIZONTALPODAUTOSCALERInfrastructure:
K8S_CLUSTERK8S_NAMESPACEK8S_NODEK8S_PODConfiguration:
K8S_SERVICEK8S_CONFIGMAPK8S_SECRETK8S_PERSISTENTVOLUMECLAIMK8S_PERSISTENTVOLUMEK8S_INGRESSK8S_NETWORKPOLICY工作负载: 、、、、、
基础设施:、、、
配置:、、、、、、
K8S_DEPLOYMENTK8S_STATEFULSETK8S_DAEMONSETK8S_JOBK8S_CRONJOBK8S_HORIZONTALPODAUTOSCALER基础设施:
K8S_CLUSTERK8S_NAMESPACEK8S_NODEK8S_POD配置:
K8S_SERVICEK8S_CONFIGMAPK8S_SECRETK8S_PERSISTENTVOLUMECLAIMK8S_PERSISTENTVOLUMEK8S_INGRESSK8S_NETWORKPOLICYQuery Types
查询类型
smartscapeNodes - Query K8s entities:
dql
smartscapeNodes K8S_POD
| filter k8s.namespace.name == "production"
| fields k8s.cluster.name, k8s.pod.nametimeseries - Monitor metrics over time:
dql
timeseries cpu = sum(dt.kubernetes.container.cpu_usage),
by: {k8s.pod.name, k8s.namespace.name}
| fieldsAdd avg_cpu = arrayAvg(cpu)fetch logs - Analyze log events:
dql
fetch logs
| filter k8s.namespace.name == "production" and loglevel == "ERROR"smartscapeNodes - 查询K8s实体:
dql
smartscapeNodes K8S_POD
| filter k8s.namespace.name == "production"
| fields k8s.cluster.name, k8s.pod.nametimeseries - 监控时间维度的指标:
dql
timeseries cpu = sum(dt.kubernetes.container.cpu_usage),
by: {k8s.pod.name, k8s.namespace.name}
| fieldsAdd avg_cpu = arrayAvg(cpu)fetch logs - 分析日志事件:
dql
fetch logs
| filter k8s.namespace.name == "production" and loglevel == "ERROR"Core Fields
核心字段
- ,
k8s.cluster.name,k8s.namespace.name,k8s.pod.namek8s.node.name - ,
k8s.workload.name,k8s.workload.kindk8s.container.name - - Full JSON configuration for deep inspection
k8s.object - - Access labels and annotations
tags[label]
- 、
k8s.cluster.name、k8s.namespace.name、k8s.pod.namek8s.node.name - 、
k8s.workload.name、k8s.workload.kindk8s.container.name - - 用于深度排查的完整JSON配置
k8s.object - - 访问标签和注解
tags[label]
Available Metrics
可用指标
CPU: , , ,
Memory:, ,
Operations:,
Node:, ,
,
dt.kubernetes.container.cpu_usagecpu_throttledlimits_cpurequests_cpuMemory:
dt.kubernetes.container.memory_working_setlimits_memoryrequests_memoryOperations:
dt.kubernetes.container.restartsoom_killsNode:
dt.kubernetes.node.pods_allocatablecpu_allocatablememory_allocatabledt.kubernetes.podsCPU: 、、、
内存:、、
运维:、
节点:、、、
dt.kubernetes.container.cpu_usagecpu_throttledlimits_cpurequests_cpu内存:
dt.kubernetes.container.memory_working_setlimits_memoryrequests_memory运维:
dt.kubernetes.container.restartsoom_kills节点:
dt.kubernetes.node.pods_allocatablecpu_allocatablememory_allocatabledt.kubernetes.podsEntity Disambiguation
实体区分
K8S_PODCONTAINER- — K8s-native entities with
K8S_PODJSON, scheduling state, conditions, and K8s metrics. Use this skill.k8s.object - — Host-level container inventory (image, lifetime, host assignment). Use
CONTAINERskill instead.dt-obs-hosts
The smartscape edge is . To reach containers from a pod, traverse backward:
CONTAINER --(is_part_of)--> K8S_PODdql
smartscapeNodes K8S_POD
| filter k8s.namespace.name == "<namespace>"
| traverse edgeTypes: {is_part_of}, targetTypes: {CONTAINER}, direction: backward, fieldsKeep: {id}
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, container.id=idK8S_PODCONTAINER- — K8s原生实体,包含
K8S_PODJSON、调度状态、运行状态和K8s指标,本技能适用该类型。k8s.object - — 主机级容器资产(镜像、生命周期、主机分配),请使用
CONTAINER技能查询。dt-obs-hosts
smartscape关联关系为 。要从Pod查询关联的容器,可反向遍历:
CONTAINER --(is_part_of)--> K8S_PODdql
smartscapeNodes K8S_POD
| filter k8s.namespace.name == "<namespace>"
| traverse edgeTypes: {is_part_of}, targetTypes: {CONTAINER}, direction: backward, fieldsKeep: {id}
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, container.id=idService → K8S_POD Correlation
服务与K8S_POD关联
No direct smartscape edge exists between and . The correlation key is the shared dimension . See Service → Pod Drill-Down in for the full two-step pattern.
SERVICEK8S_PODk8s.workload.namereferences/pod-debugging.mdSERVICEK8S_PODk8s.workload.namereferences/pod-debugging.mdCommon Workflows
常用工作流
1. Cluster Health Check
1. 集群健康检查
List all clusters:
dql
smartscapeNodes K8S_CLUSTER
| fields k8s.cluster.name, k8s.cluster.version, k8s.cluster.distributionCheck node capacity:
dql
timeseries {
current_pods = avg(dt.kubernetes.pods),
max_pods = avg(dt.kubernetes.node.pods_allocatable)
}, by: {k8s.node.name, k8s.cluster.name}
| fieldsAdd pod_capacity_pct = (arrayAvg(current_pods) / arrayAvg(max_pods)) * 100
| filter pod_capacity_pct > 80Identify pods in non-Running state:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| fieldsAdd phase = config[status][phase]
| filter phase != "Running"
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, phase列出所有集群:
dql
smartscapeNodes K8S_CLUSTER
| fields k8s.cluster.name, k8s.cluster.version, k8s.cluster.distribution检查节点容量:
dql
timeseries {
current_pods = avg(dt.kubernetes.pods),
max_pods = avg(dt.kubernetes.node.pods_allocatable)
}, by: {k8s.node.name, k8s.cluster.name}
| fieldsAdd pod_capacity_pct = (arrayAvg(current_pods) / arrayAvg(max_pods)) * 100
| filter pod_capacity_pct > 80识别非Running状态的Pod:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| fieldsAdd phase = config[status][phase]
| filter phase != "Running"
| fields k8s.cluster.name, k8s.namespace.name, k8s.pod.name, phase2. Resource Optimization
2. 资源优化
Find over-provisioned pods (usage < 30%):
dql
timeseries {
cpu_usage = sum(dt.kubernetes.container.cpu_usage),
cpu_requests = avg(dt.kubernetes.container.requests_cpu)
}, by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd usage_pct = (arrayAvg(cpu_usage) / arrayAvg(cpu_requests)) * 100
| filter usage_pct < 30 and arrayAvg(cpu_requests) > 0Identify containers without limits:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
cpu_limit = container[resources][limits][cpu],
memory_limit = container[resources][limits][memory]
| filter isNull(cpu_limit) or isNull(memory_limit)查找配置过剩的Pod(使用率<30%):
dql
timeseries {
cpu_usage = sum(dt.kubernetes.container.cpu_usage),
cpu_requests = avg(dt.kubernetes.container.requests_cpu)
}, by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd usage_pct = (arrayAvg(cpu_usage) / arrayAvg(cpu_requests)) * 100
| filter usage_pct < 30 and arrayAvg(cpu_requests) > 0识别未配置资源限制的容器:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
cpu_limit = container[resources][limits][cpu],
memory_limit = container[resources][limits][memory]
| filter isNull(cpu_limit) or isNull(memory_limit)3. Troubleshooting Pod Issues
3. Pod问题排查
Find pods with OOMKills:
dql
timeseries oom_kills = sum(dt.kubernetes.container.oom_kills),
by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| filter arraySum(oom_kills) > 0
| fieldsAdd total_oom_kills = arraySum(oom_kills)
| sort total_oom_kills descAnalyze pod restart patterns:
dql
timeseries restarts = sum(dt.kubernetes.container.restarts),
by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd total_restarts = arraySum(restarts)
| filter total_restarts > 5查找出现OOMKills的Pod:
dql
timeseries oom_kills = sum(dt.kubernetes.container.oom_kills),
by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| filter arraySum(oom_kills) > 0
| fieldsAdd total_oom_kills = arraySum(oom_kills)
| sort total_oom_kills desc分析Pod重启模式:
dql
timeseries restarts = sum(dt.kubernetes.container.restarts),
by: {k8s.pod.name, k8s.namespace.name, k8s.cluster.name}
| fieldsAdd total_restarts = arraySum(restarts)
| filter total_restarts > 54. Security Assessment
4. 安全评估
Identify privileged containers:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
privileged = container[securityContext][privileged]
| filter privileged == trueFind containers running as root:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
run_as_user = container[securityContext][runAsUser],
run_as_non_root = container[securityContext][runAsNonRoot]
| filter (isNull(run_as_user) or run_as_user == 0) and run_as_non_root != true识别特权容器:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
privileged = container[securityContext][privileged]
| filter privileged == true查找以root身份运行的容器:
dql
smartscapeNodes K8S_POD
| parse k8s.object, "JSON:config"
| expand container = config[spec][containers]
| fieldsAdd
container_name = container[name],
run_as_user = container[securityContext][runAsUser],
run_as_non_root = container[securityContext][runAsNonRoot]
| filter (isNull(run_as_user) or run_as_user == 0) and run_as_non_root != true5. Scheduling Analysis
5. 调度分析
Verify pod distribution (HA compliance):
dql
smartscapeNodes K8S_POD
| filter k8s.workload.kind == "deployment"
| summarize pod_count = count(),
node_count = countDistinct(k8s.node.name),
by: {k8s.cluster.name, k8s.namespace.name, k8s.workload.name}
| fieldsAdd ha_compliant = node_count > 1
| filter pod_count >= 2 and not ha_compliant验证Pod分布(高可用合规):
dql
smartscapeNodes K8S_POD
| filter k8s.workload.kind == "deployment"
| summarize pod_count = count(),
node_count = countDistinct(k8s.node.name),
by: {k8s.cluster.name, k8s.namespace.name, k8s.workload.name}
| fieldsAdd ha_compliant = node_count > 1
| filter pod_count >= 2 and not ha_compliant6. DAVIS Problems affecting K8s Entities
6. 影响K8s实体的DAVIS问题
Find active DAVIS problems affecting K8s entities:
dql
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(smartscape.affected_entity.types, "K8S_")
| fields display_id, event.name, event.category, smartscape.affected_entity.idsUse entries (array of Smartscape IDs) to look up the affected entity using its Smartscape ID.
smartscape.affected_entity.ids查找影响K8s实体的活跃DAVIS问题:
dql
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(smartscape.affected_entity.types, "K8S_")
| fields display_id, event.name, event.category, smartscape.affected_entity.ids使用(Smartscape ID数组)可通过Smartscape ID查询受影响的实体详情。
smartscape.affected_entity.idsBest Practices
最佳实践
Query Performance
查询性能
- Filter early - Apply cluster/namespace filters immediately
- Use specific entity types - Avoid wildcards
- Limit result sets - Use for exploration
limit
- 提前过滤 - 优先应用集群/命名空间过滤条件
- 使用指定实体类型 - 避免使用通配符
- 限制结果集 - 探索查询时使用限制返回数量
limit
Monitoring Recommendations
监控建议
- Set resource limits on all containers
- Monitor OOMKills and adjust memory limits
- Track CPU throttling and adjust CPU limits
- Review resource efficiency regularly (target 70-80%)
- Implement security best practices (non-root, read-only filesystem)
- Use specific image tags (avoid :latest)
- 为所有容器配置资源限制
- 监控OOMKills并调整内存限制
- 跟踪CPU限流并调整CPU限制
- 定期复盘资源效率(目标70-80%利用率)
- 落地安全最佳实践(非root运行、只读文件系统)
- 使用明确的镜像标签(避免使用:latest)
Configuration Standards
配置规范
- Use labels for organization (app, environment, team)
- Set resource requests and limits
- Configure health checks (liveness/readiness probes)
- Use TLS for all ingress resources
- Document with annotations
- 使用标签做资源归类(应用、环境、团队)
- 配置资源请求和限制
- 配置健康检查(存活/就绪探针)
- 所有Ingress资源使用TLS
- 使用注解做好文档说明
Limitations
局限
Unavailable Metrics:
- Pod network metrics (rx_bytes, tx_bytes) are NOT available in Grail
- Workaround: Use service mesh metrics or host-level network metrics
Query Considerations:
- Minimize result set size: Do not include the field if not necessary
k8s.object - Keep result set as simple as possible: Parsing k8s.object increases query complexity
- Large clusters may require pagination or time-range limits
- Some K8s status fields update asynchronously
不可用指标:
- Pod网络指标(rx_bytes、tx_bytes)在Grail中暂不支持
- 替代方案:使用服务网格指标或主机级网络指标
查询注意事项:
- 减小结果集大小:非必要场景不要包含字段
k8s.object - 保持结果集尽可能简单:解析k8s.object会增加查询复杂度
- 大型集群查询可能需要分页或限制时间范围
- 部分K8s状态字段为异步更新