AI Infra 训练营
总览
  • Day 1 · 集群起步 + CNI
  • Day 2 · 控制面 + etcd
  • Day 3 · CRD + Operator + Webhook
  • Day 4 · 存储深度
  • Day 5 · 卷扩容 + 安全
  • Day 6 · 调度 + 可观测
  • Day 7 · Harbor + ArgoCD + Mesh
  • Day 8 · AI Infra
  • Day 9 · Triton + GPU
  • Day 10 · MIG + HPA + 量化
  • Day 11 · AI Agent 端到端
  • Day 12 · 灾备
  • Day 13 · Operator + 联邦 + Mesh + RAG
  • Day 14 · CKA / CKS + 总结
  • LLM 训练手册
  • RAG + Agent 手册
  • 推理优化手册
  • 上下文工程手册
  • Agent 开发手册
  • 面试深度复盘
  • 训练 v2 深度手册
HiHuo 主站
GitHub
总览
  • Day 1 · 集群起步 + CNI
  • Day 2 · 控制面 + etcd
  • Day 3 · CRD + Operator + Webhook
  • Day 4 · 存储深度
  • Day 5 · 卷扩容 + 安全
  • Day 6 · 调度 + 可观测
  • Day 7 · Harbor + ArgoCD + Mesh
  • Day 8 · AI Infra
  • Day 9 · Triton + GPU
  • Day 10 · MIG + HPA + 量化
  • Day 11 · AI Agent 端到端
  • Day 12 · 灾备
  • Day 13 · Operator + 联邦 + Mesh + RAG
  • Day 14 · CKA / CKS + 总结
  • LLM 训练手册
  • RAG + Agent 手册
  • 推理优化手册
  • 上下文工程手册
  • Agent 开发手册
  • 面试深度复盘
  • 训练 v2 深度手册
HiHuo 主站
GitHub
  • Day 0 · 环境与硬件

    • Day 0:5 节点裸 Ubuntu → K8s 装机基线
  • Week 1:K8s 内核 + 周边基础设施

    • Day 1:3 CP HA 集群 + CNI 选型 + DNS 调优
    • Day 2: 控制面 deep dive + etcd 内核 + chaos drill
    • Day 3: CRD + Operator (kubebuilder 从 0 写)
    • Day 4: Storage 主线 + Cilium 二探
    • Day 5: Volume Expansion + 安全主线
    • Day 6: 调度 + 观测主线 + Day 2 遗留修复
    • Day 7: Harbor + ArgoCD + Cilium Service Mesh
  • Week 2:制品 + GitOps + AI Infra + 综合

    • Day 8 主线 — AI Infra: GPU + k3s + vLLM + Qwen2.5
    • Day 8 主线 — AI Infra 尝试 1 (跨 WAN GPU 加入主集群)
    • Day 8 (alt) — AlertManager 真接入 + PrometheusRule 实战
    • Day 8: CI Infrastructure — Gitea + Jenkins + Kaniko
    • Day 9: Triton + GPU Metrics + 推理性能对比
    • Day 10: MIG + 量化 + HPA Custom Metrics
    • Day 11: AI Agent 业务端到端 — 把 Day 1-10 全部串起来
    • Day 12: 灾难恢复 + 生产事故注入
    • Day 13: LLM Operator + 联邦 + Mesh + RAG
    • Day 14: CKA/CKS 真题演练 + 14 天 Bootcamp 终极总结

Day 8 (alt) — AlertManager 真接入 + PrometheusRule 实战

背景: GPU 节点 SSH 断,主线 AI Infra 暂停;切到 AlertManager 主题 耗时: 1 小时 集群状态: kps + 35 内置 rules + 5 自定义 rules, 28 firing alerts 实时


0. TL;DR

  1. 起 mock webhook receiver (alert-receiver/webhook-mock Deployment + Service, mendhak/http-https-echo)
  2. AlertmanagerConfig CR 接 webhook (Operator-pattern, 比 helm upgrade 优雅)
  3. 自定义 5 条 PrometheusRule (NodeCPU / Longhorn / PVC / Etcd / Apiserver)
  4. 验证: 28 个 firing alerts 端到端送达 webhook,JSON 完整

1. AlertManager 真接入 — 4 个关键概念

PrometheusRule (CRD)       ← 你写的告警规则
   ↓ Operator generates
Prometheus 内 rule.yaml    ← Prometheus 每 30s eval 一次
   ↓ alert state: firing
Alertmanager API           ← Prometheus 推送 firing alert
   ↓ route by labels
Receiver (webhook/email)   ← 实际通知渠道

Operator-pattern 优势:

  • 用户写 PrometheusRule / AlertmanagerConfig 两个 CRD
  • prometheus-operator 把 CRD 翻译成 Prometheus / Alertmanager 内部 yaml
  • GitOps 友好: 全是声明式 K8s 资源,可 ArgoCD 管理

2. 实战 (6 维度日志)

2.A — Mock Webhook Receiver

What:

apiVersion: apps/v1
kind: Deployment
metadata: {name: webhook-mock, namespace: alert-receiver}
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: c
        image: mendhak/http-https-echo:latest    # 现成 echo server
        ports: [{containerPort: 8080}]
        env:
        - {name: HTTP_PORT, value: '8080'}

mendhak/http-https-echo = 把 POST body + headers echo 到 stdout — 完美的 webhook 调试

Why 不直接接钉钉/Slack:

  • demo 环境没真实凭证
  • mock webhook 看完整 JSON, 接生产时只需改 webhook URL + 加 template
  • 调试更直接

Actual:

pod/webhook-mock-685bc4698c-vctv7   1/1   Running
svc/webhook-mock                    ClusterIP 10.103.39.189:80

2.B — AlertmanagerConfig CRD

What (monitoring.coreos.com/v1alpha1 CR):

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: bootcamp-routes
  namespace: monitoring
  labels:
    alertmanagerConfig: bootcamp     # ← matchLabel for AM selector
spec:
  route:
    receiver: webhook-mock
    groupBy: [alertname, severity]
    groupWait: 10s
    groupInterval: 30s
    repeatInterval: 1h
    routes:
    - matchers:
      - {name: severity, value: critical}
      receiver: webhook-mock
      groupWait: 0s                  # critical 立即推
    - matchers:
      - {name: alertname, value: Watchdog}
      receiver: webhook-mock
  receivers:
  - name: webhook-mock
    webhookConfigs:
    - url: http://webhook-mock.alert-receiver.svc.cluster.local/alerts
      sendResolved: true              # ← resolve 后也通知一次

关键: AlertManager Pod 默认不接受任何 AlertmanagerConfig (selector 是 {}),要 patch:

kubectl patch alertmanager kps-kube-prometheus-stack-alertmanager -n monitoring \
  --type=merge -p '{"spec":{"alertmanagerConfigSelector":{"matchLabels":{"alertmanagerConfig":"bootcamp"}}}}'

Operator 自动合并行为:

  • 把你的 receiver 加上 namespace 前缀,变成 monitoring/bootcamp-routes/webhook-mock
  • 把你的 route 转成 sub-route,自动加 matcher namespace="monitoring"(命名空间隔离)
  • 这意味着只有 namespace=monitoring 的 alert 走你的 webhook

Why: 多租户隔离 — A 团队的 AlertmanagerConfig 不该看到 B 团队的 alert

2.C — 5 条自定义 PrometheusRule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: bootcamp-custom-alerts
  namespace: monitoring
  labels:
    release: kps              # ← kps chart selector 抓这个 label
spec:
  groups:
  - name: bootcamp.cluster
    rules:
    # ① CPU 高
    - alert: NodeCPUHigh
      expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 2m
      labels: {severity: warning}
      annotations:
        summary: 'Node CPU > 80% on {{ $labels.instance }}'

    # ② Longhorn volume 降级
    - alert: LonghornVolumeDegraded
      expr: longhorn_volume_robustness == 2
      for: 3m
      labels: {severity: warning}

    # ③ PVC > 85%
    - alert: PVCFillingUp
      expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.85
      for: 5m
      labels: {severity: critical}

    # ④ etcd p99 写延迟 > 1s
    - alert: EtcdHighWriteLatency
      expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 1
      for: 5m
      labels: {severity: critical}

    # ⑤ apiserver 5xx 率 > 5%
    - alert: ApiserverHighErrorRate
      expr: sum(rate(apiserver_request_total{code=~"5.."}[5m])) / sum(rate(apiserver_request_total[5m])) > 0.05
      for: 5m
      labels: {severity: critical}

关键 label release: kps:

  • kube-prometheus-stack chart 给 Prometheus operator 配的 ruleSelector.matchLabels: {release: kps}
  • 你的 PrometheusRule 不带这个 label → Operator 不会把它打包给 Prometheus
  • 新手最容易踩的坑

Verified:

kubectl get prometheusrule bootcamp-custom-alerts -n monitoring
# Prometheus API:
curl http://prom:9090/api/v1/rules | grep bootcamp
→ Group bootcamp.cluster: 5 rules (NodeCPUHigh / LonghornVolumeDegraded / PVCFillingUp / EtcdHighWriteLatency / ApiserverHighErrorRate)

2.D — 端到端验证: 28 firing alerts

Actual (Prometheus active alerts):

total: 29, firing: 28

- etcdMembersDown                              warning  ← 之前 cp-2/cp-3 audit fix 时 etcd 重启,Prometheus 记忆了
- etcdInsufficientMembers                      critical
- KubeControllerManagerInstanceUnreachable     warning (x3)
- TargetDown                                    warning (x3)
- KubePodCrashLooping                          warning  ← Grafana CrashLoop
- KubeDeploymentReplicasMismatch               warning
- KubeDeploymentRolloutStuck                   warning
... 等等

Webhook 收到的 JSON 示例(节选):

{
  "receiver": "monitoring/bootcamp-routes/webhook-mock",
  "status": "firing",
  "alerts": [{
    "status": "firing",
    "labels": {
      "alertname": "KubePodCrashLooping",
      "container": "grafana",
      "namespace": "monitoring",
      "pod": "kps-grafana-...",
      "reason": "CrashLoopBackOff",
      "severity": "warning"
    },
    "annotations": {
      "description": "Pod monitoring/kps-grafana-... (grafana) is in waiting state (reason: \"CrashLoopBackOff\") on cluster .",
      "runbook_url": "https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping",
      "summary": "Pod is crash looping."
    },
    "startsAt": "2026-05-26T08:37:44.174Z",
    "fingerprint": "7eced4013f5f4ede"
  }],
  "groupLabels": {"alertname": "KubePodCrashLooping", "severity": "warning"},
  "externalURL": "http://kps-kube-prometheus-stack-alertmanager.monitoring:9093"
}

✅ JSON 含: startsAt / endsAt / fingerprint / runbook_url / generatorURL / labels / annotations 全部生产所需字段

2.E — 接生产 Webhook (替换演示)

钉钉 / 企业微信 / Slack 都是 POST JSON,template 不同:

钉钉

receivers:
- name: dingtalk
  webhookConfigs:
  - url: https://oapi.dingtalk.com/robot/send?access_token=XXX
    sendResolved: true

还需要装 prometheus-webhook-dingtalk Pod 做 template 转换(Alertmanager 默认 JSON → 钉钉 markdown 结构)

Slack

receivers:
- name: slack
  slackConfigs:
  - apiURL: https://hooks.slack.com/services/XXX
    channel: '#alerts'
    sendResolved: true
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

企业微信

receivers:
- name: wechat
  wechatConfigs:
  - corpId: XXX
    apiSecret: XXX
    sendResolved: true
    toUser: '@all'

全部只需改 receiver 配置 + 加 template,逻辑完全不动


3. Webhook 实测样本(完整 trace)

webhook-mock log 实时输出每个 alert 的 HTTP POST:

::ffff:10.244.4.123 - - [26/May/2026:09:25:50 +0000] "POST /alerts HTTP/1.1" 200 4487 "-" "Alertmanager/0.32.1"
{"path":"/alerts", ...
 body:"{\"receiver\":\"monitoring/bootcamp-routes/webhook-mock\",
        \"status\":\"firing\",
        \"alerts\":[{\"labels\":{\"alertname\":\"TargetDown\",...},
                    \"annotations\":{...},
                    \"startsAt\":\"2026-05-26T06:58:24.447Z\"}]}"}

Alertmanager 给每个 alert 组送一次 POST(group_wait 等待 group 内同类 alert 凑齐再发),实测 4 个 alert group 都发送成功(HTTP 200)。


4. AlertManager 生产最佳实践

实践Why
group_by: [alertname, namespace]同类 alert 合并通知,避免刷屏
group_wait: 10s (普通) / 0s (critical)critical 立即推, warn 等几秒看是否还有同类
repeat_interval: 4h (不要 1h!)24h 内同 alert 6 次推送 vs 24 次,运维不抓狂
inhibit_rules: critical 静默同 ns 同 alert 的 warning减噪
Webhook 加 retry (3 次 + 指数 backoff)渠道偶发 down 不丢 alert
不同 severity 走不同渠道 (critical→PagerDuty,warn→Slack)区分紧急度

5. 简历可写

落地 AlertManager 生产级告警体系:

  • AlertmanagerConfig CRD + 5 条自定义 PrometheusRule (Node/Longhorn/PVC/etcd/apiserver)
  • 端到端验证: 28 个 firing alert 实时 POST 到 webhook,完整 JSON (fingerprint/runbook_url/labels)
  • 多渠道路由: 钉钉/Slack/企微/PagerDuty receiver template 化, severity 分级
  • inhibit_rules 静默次级 alert,group_wait/repeat_interval 优化降噪

99. Day 8 alt 完整状态

  • [x] alt.A AlertManager Webhook 接入 (Operator CRD pattern)
  • [x] alt.B 5 条自定义 PrometheusRule (业务 + 基础设施)
  • [x] alt.C 端到端 28 firing alert 验证 (Grafana 自然 CrashLoop 触发了多个)

6. 接 GPU 节点恢复后,Day 9 接力

Day 8 主线 (GPU + vLLM) 待续。一旦 GPU 节点 SSH 恢复:

  1. 装 k3s 单节点 (INSTALL_K3S_MIRROR=cn)
  2. 装 NVIDIA Container Toolkit + GPU Operator (helm chart)
  3. 验证 nvidia-smi Pod
  4. 部署 vLLM + Qwen2.5-3B (40G HBM 完全够,可跑 7B 或 14B)
  5. 主集群 Pod curl 公网 GPU 节点 NodePort,Hubble 看 L7 流量
在 GitHub 上编辑此页
Prev
Day 8 主线 — AI Infra 尝试 1 (跨 WAN GPU 加入主集群)
Next
Day 8: CI Infrastructure — Gitea + Jenkins + Kaniko