Kubernetes监控告警
大约 6 分钟Kubernetes指南Kubernetes监控告警
Kubernetes监控告警
Kubernetes监控告警概述
Kubernetes监控告警是确保集群稳定运行和应用高可用的关键环节。通过全面的监控和及时的告警,可以快速发现和解决潜在问题,保障业务的连续性。
监控架构
1. 监控组件
典型的Kubernetes监控架构包括以下组件:
Metrics Server
Metrics Server是Kubernetes集群资源使用指标的聚合器。
# 部署Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# 验证部署
kubectl get deployment metrics-server -n kube-system
Prometheus
Prometheus是开源的系统监控和告警工具包。
# Prometheus部署配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.30.0
ports:
- containerPort: 9090
volumeMounts:
- name: config-volume
mountPath: /etc/prometheus/
- name: storage-volume
mountPath: /prometheus/
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: storage-volume
emptyDir: {}
Grafana
Grafana是开源的可视化平台,用于展示监控数据。
# Grafana部署配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:8.1.0
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
value: "admin123"
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
volumes:
- name: grafana-storage
emptyDir: {}
核心监控指标
1. 节点级别指标
# 查看节点资源使用
kubectl top nodes
# 查看节点详细信息
kubectl describe node node-name
关键节点指标:
- CPU使用率
- 内存使用率
- 磁盘使用率
- 网络I/O
- 节点状态
2. Pod级别指标
# 查看Pod资源使用
kubectl top pods
# 查看Pod详细信息
kubectl describe pod pod-name
关键Pod指标:
- CPU使用量
- 内存使用量
- 网络流量
- 存储I/O
- Pod状态
3. 应用级别指标
# 查看Deployment状态
kubectl get deployments
# 查看Service状态
kubectl get services
# 查看应用日志
kubectl logs pod-name
关键应用指标:
- 请求响应时间
- 错误率
- 吞吐量
- 可用性
- 延迟
Prometheus监控配置
1. Prometheus配置文件
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert.rules"
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
2. 监控目标配置
# ServiceMonitor配置(Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
3. 自定义指标导出
# 应用部署配置,包含指标端点
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
spec:
containers:
- name: app
image: myapp:latest
ports:
- name: metrics
containerPort: 8080
readinessProbe:
httpGet:
path: /metrics
port: metrics
告警规则配置
1. Prometheus告警规则
# alert.rules
groups:
- name: kubernetes-alerts
rules:
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 2 minutes"
- alert: HighMemoryUsage
expr: (container_memory_usage_bytes / container_memory_limit_bytes) > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 80% for more than 2 minutes"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
for: 10m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
description: "Pod has been restarted more than once in the last 5 minutes"
2. Alertmanager配置
# alertmanager.yml
global:
smtp_smarthost: 'localhost:25'
smtp_from: 'alertmanager@example.com'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'admin@example.com'
send_resolved: true
日志监控
1. EFK Stack部署
# Elasticsearch部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: elasticsearch
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:7.10.0
env:
- name: discovery.type
value: single-node
ports:
- containerPort: 9200
# Fluentd部署
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1.12.0-debian-elasticsearch7-1.0
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
# Kibana部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
- name: kibana
image: docker.elastic.co/kibana/kibana:7.10.0
ports:
- containerPort: 5601
env:
- name: ELASTICSEARCH_HOSTS
value: '["http://elasticsearch.logging.svc.cluster.local:9200"]'
2. 日志收集配置
# Fluentd配置示例
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
<match kubernetes.**>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
logstash_format true
</match>
性能监控
1. 资源使用监控
# 实时监控资源使用
watch -n 1 'kubectl top nodes'
watch -n 1 'kubectl top pods'
# 查看资源配额使用情况
kubectl describe resourcequota -n namespace-name
# 查看LimitRange配置
kubectl describe limitrange -n namespace-name
2. 网络性能监控
# 监控网络策略
kubectl get networkpolicies --all-namespaces
# 查看服务端点
kubectl get endpoints --all-namespaces
# 监控网络流量
kubectl exec -it pod-name -- iftop
3. 存储性能监控
# 查看存储使用情况
kubectl get pv,pvc --all-namespaces
# 监控存储I/O
kubectl exec -it pod-name -- iostat -x 1 5
监控仪表板
1. Grafana仪表板配置
{
"dashboard": {
"id": null,
"title": "Kubernetes Cluster Monitoring",
"panels": [
{
"title": "Cluster CPU Usage",
"type": "graph",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total[5m]))",
"legendFormat": "CPU Usage"
}
]
},
{
"title": "Cluster Memory Usage",
"type": "graph",
"targets": [
{
"expr": "sum(container_memory_usage_bytes)",
"legendFormat": "Memory Usage"
}
]
}
]
}
}
2. 自定义监控面板
# 自定义监控应用
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-monitor
spec:
replicas: 1
selector:
matchLabels:
app: custom-monitor
template:
metadata:
labels:
app: custom-monitor
spec:
containers:
- name: monitor
image: my-monitor:latest
env:
- name: MONITOR_TARGET
value: "http://myapp-service:8080/health"
ports:
- containerPort: 8080
告警策略
1. 分级告警
# 分级告警规则
groups:
- name: alert-levels
rules:
- alert: HighCPUWarning
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.7
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage high"
- alert: HighCPUCritical
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "CPU usage critical"
2. 告警抑制
# Alertmanager抑制规则
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'namespace']
3. 告警路由
# Alertmanager路由配置
route:
group_by: ['alertname']
receiver: 'default'
routes:
- match:
severity: 'critical'
receiver: 'pager'
- match:
severity: 'warning'
receiver: 'email'
监控最佳实践
1. 监控覆盖
# 确保监控覆盖所有关键组件
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kubernetes-monitoring
spec:
selector:
matchLabels:
k8s-app: kubelet
endpoints:
- port: https-metrics
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
2. 性能优化
# 优化Prometheus配置
global:
scrape_interval: 30s # 适当调整抓取间隔
scrape_timeout: 10s
# 使用记录规则优化查询性能
groups:
- name: recording-rules
rules:
- record: job:container_cpu_usage_seconds_total:rate2m
expr: rate(container_cpu_usage_seconds_total[2m])
3. 数据保留策略
# Prometheus数据保留配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
template:
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.30.0
args:
- '--storage.tsdb.retention.time=30d' # 保留30天数据
- '--storage.tsdb.retention.size=50GB' # 保留50GB数据
故障排查
1. 监控组件故障
# 检查监控组件状态
kubectl get pods -n monitoring
# 查看组件日志
kubectl logs -n monitoring prometheus-0
# 检查服务端点
kubectl get endpoints -n monitoring
2. 告警不触发
# 检查告警规则
kubectl exec -it prometheus-0 -n monitoring -- promtool check rules /etc/prometheus/alert.rules
# 查看告警状态
kubectl port-forward svc/prometheus 9090:9090 -n monitoring
# 访问 http://localhost:9090/alerts
3. 性能问题
# 检查资源使用
kubectl top pods -n monitoring
# 查看Prometheus指标
curl http://prometheus:9090/metrics
# 分析慢查询
kubectl port-forward svc/prometheus 9090:9090 -n monitoring
# 访问 http://localhost:9090/graph 查看查询性能
常用监控命令
命令 | 说明 |
---|---|
kubectl top nodes | 查看节点资源使用 |
kubectl top pods | 查看Pod资源使用 |
kubectl get events | 查看集群事件 |
kubectl logs pod-name | 查看Pod日志 |
kubectl describe pod pod-name | 查看Pod详细信息 |
kubectl get componentstatuses | 查看组件状态 |
kubectl get deployments --all-namespaces | 查看所有Deployment状态 |
总结
Kubernetes监控告警是保障集群稳定运行的重要手段。通过建立完善的监控体系,配置合理的告警规则,可以及时发现和处理问题,确保业务的高可用性。在实际应用中,应该根据业务特点和需求,选择合适的监控工具和策略,并持续优化监控配置。