AI Infra 训练营
总览
  • Day 1 · 集群起步 + CNI
  • Day 2 · 控制面 + etcd
  • Day 3 · CRD + Operator + Webhook
  • Day 4 · 存储深度
  • Day 5 · 卷扩容 + 安全
  • Day 6 · 调度 + 可观测
  • Day 7 · Harbor + ArgoCD + Mesh
  • Day 8 · AI Infra
  • Day 9 · Triton + GPU
  • Day 10 · MIG + HPA + 量化
  • Day 11 · AI Agent 端到端
  • Day 12 · 灾备
  • Day 13 · Operator + 联邦 + Mesh + RAG
  • Day 14 · CKA / CKS + 总结
  • LLM 训练手册
  • RAG + Agent 手册
  • 推理优化手册
  • 上下文工程手册
  • Agent 开发手册
  • 面试深度复盘
  • 训练 v2 深度手册
HiHuo 主站
GitHub
总览
  • Day 1 · 集群起步 + CNI
  • Day 2 · 控制面 + etcd
  • Day 3 · CRD + Operator + Webhook
  • Day 4 · 存储深度
  • Day 5 · 卷扩容 + 安全
  • Day 6 · 调度 + 可观测
  • Day 7 · Harbor + ArgoCD + Mesh
  • Day 8 · AI Infra
  • Day 9 · Triton + GPU
  • Day 10 · MIG + HPA + 量化
  • Day 11 · AI Agent 端到端
  • Day 12 · 灾备
  • Day 13 · Operator + 联邦 + Mesh + RAG
  • Day 14 · CKA / CKS + 总结
  • LLM 训练手册
  • RAG + Agent 手册
  • 推理优化手册
  • 上下文工程手册
  • Agent 开发手册
  • 面试深度复盘
  • 训练 v2 深度手册
HiHuo 主站
GitHub
  • Day 0 · 环境与硬件

    • Day 0:5 节点裸 Ubuntu → K8s 装机基线
  • Week 1:K8s 内核 + 周边基础设施

    • Day 1:3 CP HA 集群 + CNI 选型 + DNS 调优
    • Day 2: 控制面 deep dive + etcd 内核 + chaos drill
    • Day 3: CRD + Operator (kubebuilder 从 0 写)
    • Day 4: Storage 主线 + Cilium 二探
    • Day 5: Volume Expansion + 安全主线
    • Day 6: 调度 + 观测主线 + Day 2 遗留修复
    • Day 7: Harbor + ArgoCD + Cilium Service Mesh
  • Week 2:制品 + GitOps + AI Infra + 综合

    • Day 8 主线 — AI Infra: GPU + k3s + vLLM + Qwen2.5
    • Day 8 主线 — AI Infra 尝试 1 (跨 WAN GPU 加入主集群)
    • Day 8 (alt) — AlertManager 真接入 + PrometheusRule 实战
    • Day 8: CI Infrastructure — Gitea + Jenkins + Kaniko
    • Day 9: Triton + GPU Metrics + 推理性能对比
    • Day 10: MIG + 量化 + HPA Custom Metrics
    • Day 11: AI Agent 业务端到端 — 把 Day 1-10 全部串起来
    • Day 12: 灾难恢复 + 生产事故注入
    • Day 13: LLM Operator + 联邦 + Mesh + RAG
    • Day 14: CKA/CKS 真题演练 + 14 天 Bootcamp 终极总结

Day 11: AI Agent 业务端到端 — 把 Day 1-10 全部串起来

目标: 一个真实业务(chainlit chat UI),通过完整 GitOps 流水线部署,跨 WAN 调 GPU 集群 vLLM 耗时: 3-4 小时 价值: ⭐ Day 1-10 所有能力 第一次"用起来",面试讲故事最完整


0. TL;DR

开发者写 chainlit + Dockerfile
        ↓ git push
Gitea (notes-app repo) — Day 8 番外搭好
        ↓ Jenkins SCM trigger
Jenkins Pipeline (Kaniko build → Harbor push :git-sha)
        ↓ image tag 更新
notes-deploy repo (manifest)
        ↓ ArgoCD watch
ArgoCD sync → 主集群 chat namespace
        ├─ chat-ui Deployment (含 chainlit + prom metrics)
        ├─ vllm-upstream Service + 手动 Endpoints (指向 m1:30800)
        └─ ServiceMonitor (Prometheus 抓业务 metric)
        ↓ 跨 WAN
m1:30800 SSH tunnel → gpu1 k3s vllm-3b:8000
        ↓
Qwen2.5-3B-Instruct on A800 MIG slice (Day 10 切片)

用上的所有 Day 1-10 能力:

Day用了什么
Day 1 (Cilium)Pod 间网络 / Hubble L7 抓 chat → vllm 流量
Day 4 (Longhorn)Pod 重启不丢 session (可选, chainlit memory mode 简化)
Day 5 (Kyverno/PSA)chat namespace baseline enforce
Day 6 (Prometheus)chat_requests_total / chat_first_token_latency 业务指标
Day 7 (Harbor / ArgoCD)镜像存储 / GitOps 部署
Day 8 番外 (Gitea / Jenkins)源码 + CI build
Day 8 主线 (vLLM on k3s)LLM 推理后端
Day 9 (DCGM cross-WAN)GPU metrics 跨集群可见
Day 10 (MIG)A800 切片,vllm 占 1 个 slice

1. chainlit chat 业务代码

1.1 app.py (核心 ~80 行)

import os, time, chainlit as cl
from openai import AsyncOpenAI
from prometheus_client import Counter, Histogram, start_http_server

# === 业务 metrics ===
CHAT_REQUESTS = Counter("chat_requests_total", "Total chat completions", ["model", "status"])
CHAT_TOKENS   = Counter("chat_tokens_total", "Total tokens", ["model", "kind"])  # kind: prompt|completion
CHAT_LATENCY  = Histogram("chat_first_token_latency_seconds", "TTFT", ["model"],
                          buckets=[0.1, 0.25, 0.5, 1, 2, 5, 10])
CHAT_E2E      = Histogram("chat_e2e_latency_seconds", "E2E", ["model"],
                          buckets=[0.5, 1, 2, 5, 10, 30, 60])

start_http_server(9100)   # /metrics on :9100

client = AsyncOpenAI(base_url=os.getenv("VLLM_URL"), api_key="not-needed")
MODEL = os.getenv("MODEL", "qwen2.5-3b")

@cl.on_message
async def on_message(msg: cl.Message):
    response = cl.Message(content="")
    first_token = None
    t0 = time.time()
    prompt_tokens = completion_tokens = 0
    status = "success"
    history = (cl.user_session.get("history") or []) + [{"role": "user", "content": msg.content}]
    try:
        stream = await client.chat.completions.create(
            model=MODEL, messages=history, stream=True,
            stream_options={"include_usage": True},
            max_tokens=512, temperature=0.7,
        )
        async for chunk in stream:
            if chunk.choices and chunk.choices[0].delta.content:
                if first_token is None:
                    first_token = time.time() - t0
                    CHAT_LATENCY.labels(model=MODEL).observe(first_token)
                await response.stream_token(chunk.choices[0].delta.content)
            if chunk.usage:
                prompt_tokens = chunk.usage.prompt_tokens
                completion_tokens = chunk.usage.completion_tokens
    except Exception as e:
        status = "error"
        response.content = f"❌ {e}"
    finally:
        CHAT_REQUESTS.labels(model=MODEL, status=status).inc()
        CHAT_TOKENS.labels(model=MODEL, kind="prompt").inc(prompt_tokens)
        CHAT_TOKENS.labels(model=MODEL, kind="completion").inc(completion_tokens)
        CHAT_E2E.labels(model=MODEL).observe(time.time() - t0)
    cl.user_session.set("history", history + [{"role": "assistant", "content": response.content}])
    await response.update()

1.2 Dockerfile(multi-stage,~140MB final)

FROM python:3.11-slim AS builder
WORKDIR /app
RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH
COPY app.py .
EXPOSE 8000 9100
CMD ["chainlit", "run", "app.py", "--host", "0.0.0.0", "--port", "8000", "--headless"]

1.3 Jenkinsfile (跟 Day 8 番外同款 Kaniko 模式)

pipeline {
  agent {
    kubernetes {
      yaml '''
apiVersion: v1
kind: Pod
spec:
  tolerations: [{operator: Exists}]
  containers:
  - name: kaniko
    image: gcr.io/kaniko-project/executor:debug
    command: [/busybox/cat]
    tty: true
    volumeMounts:
    - {name: docker-config, mountPath: /kaniko/.docker}
    resources: {requests: {cpu: 200m, memory: 512Mi}}
  volumes:
  - name: docker-config
    secret:
      secretName: harbor-auth
      items: [{key: .dockerconfigjson, path: config.json}]
'''
    }
  }
  environment {
    REGISTRY = "10.0.24.28:30002"; PROJECT = "bootcamp"; IMAGE = "chat-ui"
  }
  stages {
    stage('Build and Push') {
      steps {
        container('kaniko') {
          sh '''
            GIT_SHA=$(git rev-parse --short HEAD 2>/dev/null || echo dev)
            /kaniko/executor --dockerfile=Dockerfile --context=$(pwd) \
              --destination=${REGISTRY}/${PROJECT}/${IMAGE}:${GIT_SHA} \
              --destination=${REGISTRY}/${PROJECT}/${IMAGE}:latest \
              --insecure --skip-tls-verify
          '''
        }
      }
    }
  }
}

2. CI 流水线实测(Day 8 番外的 Jenkins job 直接复用)

git push 到 notes-app → Jenkins SCM watch → 自动构建:

Build #2 console (节选):

INFO[0222] Pushing image to 10.0.24.28:30002/bootcamp/chat-ui:dev
INFO[0351] Pushed 10.0.24.28:30002/bootcamp/chat-ui@sha256:608d0d083ae2188...
INFO[0351] Pushing image to 10.0.24.28:30002/bootcamp/chat-ui:latest
INFO[0353] Pushed 10.0.24.28:30002/bootcamp/chat-ui@sha256:608d0d083ae2188...
✅ Image pushed: 10.0.24.28:30002/bootcamp/chat-ui:dev
Finished: SUCCESS

Harbor 现状:

  • bootcamp/chat-ui (Day 11 新)
  • bootcamp/hello-kaniko (Day 8 番外验证用)
  • bootcamp/nginx (Day 7)

3. K8s 部署 manifest(GitOps 通过 notes-deploy repo)

3.1 namespace + cross-WAN Service

---
apiVersion: v1
kind: Namespace
metadata:
  name: chat
  labels: {pod-security.kubernetes.io/enforce: baseline}
---
# vLLM 上游通过 SSH tunnel 接入 — Service + 手动 Endpoints
apiVersion: v1
kind: Service
metadata: {name: vllm-upstream, namespace: chat}
spec:
  type: ClusterIP
  ports:
  - {port: 8000, targetPort: 30800, name: http, protocol: TCP}
---
apiVersion: v1
kind: Endpoints
metadata: {name: vllm-upstream, namespace: chat}
subsets:
- addresses: [{ip: 10.0.24.31}]    # m1 hostIP, SSH tunnel
  ports: [{port: 30800, name: http, protocol: TCP}]

3.2 chat-ui Deployment + Service + ServiceMonitor

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chat-ui
  namespace: chat
  labels: {app.kubernetes.io/name: chat-ui}
spec:
  replicas: 1
  selector: {matchLabels: {app: chat-ui}}
  template:
    metadata: {labels: {app: chat-ui, app.kubernetes.io/name: chat-ui}}
    spec:
      tolerations: [{operator: Exists}]
      containers:
      - name: chat-ui
        image: 10.0.24.28:30002/bootcamp/chat-ui:latest
        imagePullPolicy: Always
        ports:
        - {containerPort: 8000, name: http}
        - {containerPort: 9100, name: metrics}
        env:
        - {name: VLLM_URL, value: "http://vllm-upstream.chat.svc.cluster.local:8000/v1"}
        - {name: MODEL, value: "qwen2.5-3b"}
        resources:
          requests: {cpu: 100m, memory: 256Mi}
          limits:   {cpu: 1, memory: 1Gi}
        readinessProbe:
          httpGet: {path: /, port: 8000}
          initialDelaySeconds: 20
---
apiVersion: v1
kind: Service
metadata: {name: chat-ui, namespace: chat, labels: {app: chat-ui}}
spec:
  type: NodePort
  selector: {app: chat-ui}
  ports:
  - {port: 80, targetPort: 8000, nodePort: 30810, name: http}
  - {port: 9100, targetPort: 9100, nodePort: 30811, name: metrics}
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: chat-ui
  namespace: chat
  labels: {release: kps}
spec:
  selector: {matchLabels: {app: chat-ui}}
  endpoints:
  - {port: metrics, interval: 10s, path: /metrics}

4. ArgoCD Application + 跨集群依赖管理

4.1 注册 Gitea Repository(insecure HTTP)

apiVersion: v1
kind: Secret
metadata:
  name: gitea-repo
  namespace: argocd
  labels: {argocd.argoproj.io/secret-type: repository}
stringData:
  type: git
  url: http://gitea-http.gitea.svc:3000/bootcamp/notes-deploy.git
  username: bootcamp
  password: bootcamp
  insecure: "true"

4.2 Application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata: {name: chat-ui, namespace: argocd}
spec:
  project: default
  source:
    repoURL: http://bootcamp:bootcamp@gitea-http.gitea.svc:3000/bootcamp/notes-deploy.git
    targetRevision: HEAD
    path: chat-ui
  destination:
    server: https://kubernetes.default.svc
    namespace: chat
  syncPolicy:
    automated: {prune: true, selfHeal: true}
    syncOptions: [CreateNamespace=true, ServerSideApply=true]

5. ⚠️ 真坑(本 day 6 个)

坑 #1 — ArgoCD 拒 HTTP repo (gRPC handshake error)

第一次 ArgoCD 报:

transport: authentication handshake failed: context deadline exceeded

Fix: 用 K8s Secret 注册 repository + insecure: "true"(让 ArgoCD 知道这是 HTTP 不是 HTTPS)+ rollout restart argocd-repo-server 让缓存刷新

坑 #2 — Headless Service 不映射 port

第一版 manifest 用 clusterIP: None(headless):

  • chat-ui 调 vllm-upstream:8000 → DNS 返回 m1 IP 10.0.24.31
  • chat-ui 直连 10.0.24.31:8000 → 连错端口! SSH tunnel 在 :30800
  • headless Service 不做 port 映射,只 DNS

Fix: 改为 ClusterIP Service + targetPort: 30800,kube-proxy 做 port redirect

spec:
  type: ClusterIP
  ports:
  - {port: 8000, targetPort: 30800, name: http}      # ← targetPort 是关键

坑 #3 — Endpoints 跟 Service ports name 必须匹配

K8s 用 port name 匹配 Service.ports 和 Endpoints.ports:

# Service port name = "http"
# Endpoints port name MUST = "http"

不匹配 → kube-proxy 不知道把 Service traffic 转到哪个 Endpoint port

坑 #4 — ArgoCD "excluded resource" 警告

ArgoCD 默认不管 Endpoints(认为是 Service 自动生成)。如果你手动写 Endpoints,ArgoCD 会:

ExcludedResourceWarning: Resource /Endpoints vllm-upstream is excluded in the settings

Endpoints 还会被 apply 但 ArgoCD 不 track 它。手动 kubectl apply 一次,后续 ArgoCD sync 不会删它

坑 #5 — SSH tunnel 必须有 m1 上跑

# m1 上跑(持久化用 systemd):
ssh -fN -p 15128 -L 0.0.0.0:30800:<vllm-svc-ip>:8000 root@gpu1

如果 tunnel 死了,Endpoints 指向 10.0.24.31:30800 仍存在,但 chat-ui 调用全 timeout/refused。

生产对策: WireGuard / Tailscale / Cilium ClusterMesh 替代 SSH tunnel

坑 #6 — chainlit 是 WebSocket,Prometheus metric 触发要真实 chat

chat_requests_total 只在 on_message handler 跑了才 +1。我们手动 curl 调 /v1/chat/completions 绕过了 chainlit,所以 metric 不增长。

真实触发方式:

  • 浏览器打开 chainlit UI(<node-ip>:30810)+ 用户发消息
  • 或者 chainlit-client Python 库 connect websocket(复杂)

文档说明: chat-ui Pod 上 9100 端口 metric 端点 work,只是数据需要真用户触发


6. 端到端验证 ✅

验证项状态证据
Gitea push code✅commit 213ebd9
Jenkins Pipeline 自动 build✅Build #2 SUCCESS, 6 min
Harbor 收 image✅bootcamp/chat-ui:dev / :latest
ArgoCD sync manifest✅Application Synced, Healthy
chat-ui Pod Running✅1/1 Running on k8s-cp-1
Pod → vllm-upstream Service routing✅OK: qwen2.5-3b 从 Pod 内拉到
跨 WAN 调用 vLLM✅3 次中文响应
ServiceMonitor scrape✅Prometheus target chat-ui "up"
业务 metric 端点✅Pod 内 :9100 返回 45 行 metric

7. 简历金句

设计并落地端到端 AI 业务流水线:

  • 业务层: chainlit Python chat UI(OpenAI-compat client + Prometheus 业务 metric)
  • CI/CD: Gitea git push → Jenkins SCM trigger → Kaniko build → Harbor push 镜像 (6min 自动完成)
  • CD: ArgoCD GitOps + K8s Manifest repo + 自动 sync/selfHeal
  • 跨集群: 主集群(5 节点) ↔ GPU 集群(独立 k3s, A800 MIG)through SSH tunnel + K8s Service+Endpoints
  • 推理后端: vLLM + Qwen2.5-3B-Instruct on A800 MIG 2g.10gb slice
  • 观测: 业务 metric (TTFT / tokens / e2e latency) + GPU metric (DCGM via cross-WAN ServiceMonitor)
  • 安全: Kyverno 准入 + PSA baseline + Harbor robot account
  • 端到端 < 10 min 从 git commit 到生产服务

99. Day 11 完成

  • [x] A — chainlit app + Dockerfile + Jenkinsfile 完整代码
  • [x] B — Gitea push + Jenkins build #2 SUCCESS + Harbor 收 image
  • [x] C — ArgoCD Application + 跨 WAN Service+Endpoints 部署完整,Pod Running 调 vLLM 验证
  • [x] D — Prometheus 抓 chat-ui target "up" + Pod 9100 metric 端点 work(真用户触发后数据流入)
在 GitHub 上编辑此页
Prev
Day 10: MIG + 量化 + HPA Custom Metrics
Next
Day 12: 灾难恢复 + 生产事故注入