Day 8 (alt) — AlertManager 真接入 + PrometheusRule 实战
背景: GPU 节点 SSH 断,主线 AI Infra 暂停;切到 AlertManager 主题 耗时: 1 小时 集群状态: kps + 35 内置 rules + 5 自定义 rules, 28 firing alerts 实时
0. TL;DR
- 起 mock webhook receiver (
alert-receiver/webhook-mockDeployment + Service, mendhak/http-https-echo) - AlertmanagerConfig CR 接 webhook (Operator-pattern, 比 helm upgrade 优雅)
- 自定义 5 条 PrometheusRule (NodeCPU / Longhorn / PVC / Etcd / Apiserver)
- 验证: 28 个 firing alerts 端到端送达 webhook,JSON 完整
1. AlertManager 真接入 — 4 个关键概念
PrometheusRule (CRD) ← 你写的告警规则
↓ Operator generates
Prometheus 内 rule.yaml ← Prometheus 每 30s eval 一次
↓ alert state: firing
Alertmanager API ← Prometheus 推送 firing alert
↓ route by labels
Receiver (webhook/email) ← 实际通知渠道
Operator-pattern 优势:
- 用户写 PrometheusRule / AlertmanagerConfig 两个 CRD
- prometheus-operator 把 CRD 翻译成 Prometheus / Alertmanager 内部 yaml
- GitOps 友好: 全是声明式 K8s 资源,可 ArgoCD 管理
2. 实战 (6 维度日志)
2.A — Mock Webhook Receiver
What:
apiVersion: apps/v1
kind: Deployment
metadata: {name: webhook-mock, namespace: alert-receiver}
spec:
replicas: 1
template:
spec:
containers:
- name: c
image: mendhak/http-https-echo:latest # 现成 echo server
ports: [{containerPort: 8080}]
env:
- {name: HTTP_PORT, value: '8080'}
mendhak/http-https-echo = 把 POST body + headers echo 到 stdout — 完美的 webhook 调试
Why 不直接接钉钉/Slack:
- demo 环境没真实凭证
- mock webhook 看完整 JSON, 接生产时只需改 webhook URL + 加 template
- 调试更直接
Actual:
pod/webhook-mock-685bc4698c-vctv7 1/1 Running
svc/webhook-mock ClusterIP 10.103.39.189:80
2.B — AlertmanagerConfig CRD
What (monitoring.coreos.com/v1alpha1 CR):
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: bootcamp-routes
namespace: monitoring
labels:
alertmanagerConfig: bootcamp # ← matchLabel for AM selector
spec:
route:
receiver: webhook-mock
groupBy: [alertname, severity]
groupWait: 10s
groupInterval: 30s
repeatInterval: 1h
routes:
- matchers:
- {name: severity, value: critical}
receiver: webhook-mock
groupWait: 0s # critical 立即推
- matchers:
- {name: alertname, value: Watchdog}
receiver: webhook-mock
receivers:
- name: webhook-mock
webhookConfigs:
- url: http://webhook-mock.alert-receiver.svc.cluster.local/alerts
sendResolved: true # ← resolve 后也通知一次
关键: AlertManager Pod 默认不接受任何 AlertmanagerConfig (selector 是 {}),要 patch:
kubectl patch alertmanager kps-kube-prometheus-stack-alertmanager -n monitoring \
--type=merge -p '{"spec":{"alertmanagerConfigSelector":{"matchLabels":{"alertmanagerConfig":"bootcamp"}}}}'
Operator 自动合并行为:
- 把你的 receiver 加上 namespace 前缀,变成
monitoring/bootcamp-routes/webhook-mock - 把你的 route 转成 sub-route,自动加 matcher
namespace="monitoring"(命名空间隔离) - 这意味着只有 namespace=monitoring 的 alert 走你的 webhook
Why: 多租户隔离 — A 团队的 AlertmanagerConfig 不该看到 B 团队的 alert
2.C — 5 条自定义 PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: bootcamp-custom-alerts
namespace: monitoring
labels:
release: kps # ← kps chart selector 抓这个 label
spec:
groups:
- name: bootcamp.cluster
rules:
# ① CPU 高
- alert: NodeCPUHigh
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 2m
labels: {severity: warning}
annotations:
summary: 'Node CPU > 80% on {{ $labels.instance }}'
# ② Longhorn volume 降级
- alert: LonghornVolumeDegraded
expr: longhorn_volume_robustness == 2
for: 3m
labels: {severity: warning}
# ③ PVC > 85%
- alert: PVCFillingUp
expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.85
for: 5m
labels: {severity: critical}
# ④ etcd p99 写延迟 > 1s
- alert: EtcdHighWriteLatency
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 1
for: 5m
labels: {severity: critical}
# ⑤ apiserver 5xx 率 > 5%
- alert: ApiserverHighErrorRate
expr: sum(rate(apiserver_request_total{code=~"5.."}[5m])) / sum(rate(apiserver_request_total[5m])) > 0.05
for: 5m
labels: {severity: critical}
关键 label release: kps:
- kube-prometheus-stack chart 给 Prometheus operator 配的
ruleSelector.matchLabels: {release: kps} - 你的 PrometheusRule 不带这个 label → Operator 不会把它打包给 Prometheus
- 新手最容易踩的坑
Verified:
kubectl get prometheusrule bootcamp-custom-alerts -n monitoring
# Prometheus API:
curl http://prom:9090/api/v1/rules | grep bootcamp
→ Group bootcamp.cluster: 5 rules (NodeCPUHigh / LonghornVolumeDegraded / PVCFillingUp / EtcdHighWriteLatency / ApiserverHighErrorRate)
2.D — 端到端验证: 28 firing alerts
Actual (Prometheus active alerts):
total: 29, firing: 28
- etcdMembersDown warning ← 之前 cp-2/cp-3 audit fix 时 etcd 重启,Prometheus 记忆了
- etcdInsufficientMembers critical
- KubeControllerManagerInstanceUnreachable warning (x3)
- TargetDown warning (x3)
- KubePodCrashLooping warning ← Grafana CrashLoop
- KubeDeploymentReplicasMismatch warning
- KubeDeploymentRolloutStuck warning
... 等等
Webhook 收到的 JSON 示例(节选):
{
"receiver": "monitoring/bootcamp-routes/webhook-mock",
"status": "firing",
"alerts": [{
"status": "firing",
"labels": {
"alertname": "KubePodCrashLooping",
"container": "grafana",
"namespace": "monitoring",
"pod": "kps-grafana-...",
"reason": "CrashLoopBackOff",
"severity": "warning"
},
"annotations": {
"description": "Pod monitoring/kps-grafana-... (grafana) is in waiting state (reason: \"CrashLoopBackOff\") on cluster .",
"runbook_url": "https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping",
"summary": "Pod is crash looping."
},
"startsAt": "2026-05-26T08:37:44.174Z",
"fingerprint": "7eced4013f5f4ede"
}],
"groupLabels": {"alertname": "KubePodCrashLooping", "severity": "warning"},
"externalURL": "http://kps-kube-prometheus-stack-alertmanager.monitoring:9093"
}
✅ JSON 含: startsAt / endsAt / fingerprint / runbook_url / generatorURL / labels / annotations 全部生产所需字段
2.E — 接生产 Webhook (替换演示)
钉钉 / 企业微信 / Slack 都是 POST JSON,template 不同:
钉钉
receivers:
- name: dingtalk
webhookConfigs:
- url: https://oapi.dingtalk.com/robot/send?access_token=XXX
sendResolved: true
还需要装 prometheus-webhook-dingtalk Pod 做 template 转换(Alertmanager 默认 JSON → 钉钉 markdown 结构)
Slack
receivers:
- name: slack
slackConfigs:
- apiURL: https://hooks.slack.com/services/XXX
channel: '#alerts'
sendResolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
企业微信
receivers:
- name: wechat
wechatConfigs:
- corpId: XXX
apiSecret: XXX
sendResolved: true
toUser: '@all'
全部只需改 receiver 配置 + 加 template,逻辑完全不动
3. Webhook 实测样本(完整 trace)
webhook-mock log 实时输出每个 alert 的 HTTP POST:
::ffff:10.244.4.123 - - [26/May/2026:09:25:50 +0000] "POST /alerts HTTP/1.1" 200 4487 "-" "Alertmanager/0.32.1"
{"path":"/alerts", ...
body:"{\"receiver\":\"monitoring/bootcamp-routes/webhook-mock\",
\"status\":\"firing\",
\"alerts\":[{\"labels\":{\"alertname\":\"TargetDown\",...},
\"annotations\":{...},
\"startsAt\":\"2026-05-26T06:58:24.447Z\"}]}"}
Alertmanager 给每个 alert 组送一次 POST(group_wait 等待 group 内同类 alert 凑齐再发),实测 4 个 alert group 都发送成功(HTTP 200)。
4. AlertManager 生产最佳实践
| 实践 | Why |
|---|---|
group_by: [alertname, namespace] | 同类 alert 合并通知,避免刷屏 |
group_wait: 10s (普通) / 0s (critical) | critical 立即推, warn 等几秒看是否还有同类 |
repeat_interval: 4h (不要 1h!) | 24h 内同 alert 6 次推送 vs 24 次,运维不抓狂 |
| inhibit_rules: critical 静默同 ns 同 alert 的 warning | 减噪 |
| Webhook 加 retry (3 次 + 指数 backoff) | 渠道偶发 down 不丢 alert |
| 不同 severity 走不同渠道 (critical→PagerDuty,warn→Slack) | 区分紧急度 |
5. 简历可写
落地 AlertManager 生产级告警体系:
- AlertmanagerConfig CRD + 5 条自定义 PrometheusRule (Node/Longhorn/PVC/etcd/apiserver)
- 端到端验证: 28 个 firing alert 实时 POST 到 webhook,完整 JSON (fingerprint/runbook_url/labels)
- 多渠道路由: 钉钉/Slack/企微/PagerDuty receiver template 化, severity 分级
- inhibit_rules 静默次级 alert,group_wait/repeat_interval 优化降噪
99. Day 8 alt 完整状态
- [x] alt.A AlertManager Webhook 接入 (Operator CRD pattern)
- [x] alt.B 5 条自定义 PrometheusRule (业务 + 基础设施)
- [x] alt.C 端到端 28 firing alert 验证 (Grafana 自然 CrashLoop 触发了多个)
6. 接 GPU 节点恢复后,Day 9 接力
Day 8 主线 (GPU + vLLM) 待续。一旦 GPU 节点 SSH 恢复:
- 装 k3s 单节点 (
INSTALL_K3S_MIRROR=cn) - 装 NVIDIA Container Toolkit + GPU Operator (helm chart)
- 验证 nvidia-smi Pod
- 部署 vLLM + Qwen2.5-3B (40G HBM 完全够,可跑 7B 或 14B)
- 主集群 Pod curl 公网 GPU 节点 NodePort,Hubble 看 L7 流量