Day 11: AI Agent 业务端到端 — 把 Day 1-10 全部串起来
目标: 一个真实业务(chainlit chat UI),通过完整 GitOps 流水线部署,跨 WAN 调 GPU 集群 vLLM 耗时: 3-4 小时 价值: ⭐ Day 1-10 所有能力 第一次"用起来",面试讲故事最完整
0. TL;DR
开发者写 chainlit + Dockerfile
↓ git push
Gitea (notes-app repo) — Day 8 番外搭好
↓ Jenkins SCM trigger
Jenkins Pipeline (Kaniko build → Harbor push :git-sha)
↓ image tag 更新
notes-deploy repo (manifest)
↓ ArgoCD watch
ArgoCD sync → 主集群 chat namespace
├─ chat-ui Deployment (含 chainlit + prom metrics)
├─ vllm-upstream Service + 手动 Endpoints (指向 m1:30800)
└─ ServiceMonitor (Prometheus 抓业务 metric)
↓ 跨 WAN
m1:30800 SSH tunnel → gpu1 k3s vllm-3b:8000
↓
Qwen2.5-3B-Instruct on A800 MIG slice (Day 10 切片)
用上的所有 Day 1-10 能力:
| Day | 用了什么 |
|---|---|
| Day 1 (Cilium) | Pod 间网络 / Hubble L7 抓 chat → vllm 流量 |
| Day 4 (Longhorn) | Pod 重启不丢 session (可选, chainlit memory mode 简化) |
| Day 5 (Kyverno/PSA) | chat namespace baseline enforce |
| Day 6 (Prometheus) | chat_requests_total / chat_first_token_latency 业务指标 |
| Day 7 (Harbor / ArgoCD) | 镜像存储 / GitOps 部署 |
| Day 8 番外 (Gitea / Jenkins) | 源码 + CI build |
| Day 8 主线 (vLLM on k3s) | LLM 推理后端 |
| Day 9 (DCGM cross-WAN) | GPU metrics 跨集群可见 |
| Day 10 (MIG) | A800 切片,vllm 占 1 个 slice |
1. chainlit chat 业务代码
1.1 app.py (核心 ~80 行)
import os, time, chainlit as cl
from openai import AsyncOpenAI
from prometheus_client import Counter, Histogram, start_http_server
# === 业务 metrics ===
CHAT_REQUESTS = Counter("chat_requests_total", "Total chat completions", ["model", "status"])
CHAT_TOKENS = Counter("chat_tokens_total", "Total tokens", ["model", "kind"]) # kind: prompt|completion
CHAT_LATENCY = Histogram("chat_first_token_latency_seconds", "TTFT", ["model"],
buckets=[0.1, 0.25, 0.5, 1, 2, 5, 10])
CHAT_E2E = Histogram("chat_e2e_latency_seconds", "E2E", ["model"],
buckets=[0.5, 1, 2, 5, 10, 30, 60])
start_http_server(9100) # /metrics on :9100
client = AsyncOpenAI(base_url=os.getenv("VLLM_URL"), api_key="not-needed")
MODEL = os.getenv("MODEL", "qwen2.5-3b")
@cl.on_message
async def on_message(msg: cl.Message):
response = cl.Message(content="")
first_token = None
t0 = time.time()
prompt_tokens = completion_tokens = 0
status = "success"
history = (cl.user_session.get("history") or []) + [{"role": "user", "content": msg.content}]
try:
stream = await client.chat.completions.create(
model=MODEL, messages=history, stream=True,
stream_options={"include_usage": True},
max_tokens=512, temperature=0.7,
)
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
if first_token is None:
first_token = time.time() - t0
CHAT_LATENCY.labels(model=MODEL).observe(first_token)
await response.stream_token(chunk.choices[0].delta.content)
if chunk.usage:
prompt_tokens = chunk.usage.prompt_tokens
completion_tokens = chunk.usage.completion_tokens
except Exception as e:
status = "error"
response.content = f"❌ {e}"
finally:
CHAT_REQUESTS.labels(model=MODEL, status=status).inc()
CHAT_TOKENS.labels(model=MODEL, kind="prompt").inc(prompt_tokens)
CHAT_TOKENS.labels(model=MODEL, kind="completion").inc(completion_tokens)
CHAT_E2E.labels(model=MODEL).observe(time.time() - t0)
cl.user_session.set("history", history + [{"role": "assistant", "content": response.content}])
await response.update()
1.2 Dockerfile(multi-stage,~140MB final)
FROM python:3.11-slim AS builder
WORKDIR /app
RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
FROM python:3.11-slim
WORKDIR /app
COPY /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH
COPY app.py .
EXPOSE 8000 9100
CMD ["chainlit", "run", "app.py", "--host", "0.0.0.0", "--port", "8000", "--headless"]
1.3 Jenkinsfile (跟 Day 8 番外同款 Kaniko 模式)
pipeline {
agent {
kubernetes {
yaml '''
apiVersion: v1
kind: Pod
spec:
tolerations: [{operator: Exists}]
containers:
- name: kaniko
image: gcr.io/kaniko-project/executor:debug
command: [/busybox/cat]
tty: true
volumeMounts:
- {name: docker-config, mountPath: /kaniko/.docker}
resources: {requests: {cpu: 200m, memory: 512Mi}}
volumes:
- name: docker-config
secret:
secretName: harbor-auth
items: [{key: .dockerconfigjson, path: config.json}]
'''
}
}
environment {
REGISTRY = "10.0.24.28:30002"; PROJECT = "bootcamp"; IMAGE = "chat-ui"
}
stages {
stage('Build and Push') {
steps {
container('kaniko') {
sh '''
GIT_SHA=$(git rev-parse --short HEAD 2>/dev/null || echo dev)
/kaniko/executor --dockerfile=Dockerfile --context=$(pwd) \
--destination=${REGISTRY}/${PROJECT}/${IMAGE}:${GIT_SHA} \
--destination=${REGISTRY}/${PROJECT}/${IMAGE}:latest \
--insecure --skip-tls-verify
'''
}
}
}
}
}
2. CI 流水线实测(Day 8 番外的 Jenkins job 直接复用)
git push 到 notes-app → Jenkins SCM watch → 自动构建:
Build #2 console (节选):
INFO[0222] Pushing image to 10.0.24.28:30002/bootcamp/chat-ui:dev
INFO[0351] Pushed 10.0.24.28:30002/bootcamp/chat-ui@sha256:608d0d083ae2188...
INFO[0351] Pushing image to 10.0.24.28:30002/bootcamp/chat-ui:latest
INFO[0353] Pushed 10.0.24.28:30002/bootcamp/chat-ui@sha256:608d0d083ae2188...
✅ Image pushed: 10.0.24.28:30002/bootcamp/chat-ui:dev
Finished: SUCCESS
Harbor 现状:
bootcamp/chat-ui(Day 11 新)bootcamp/hello-kaniko(Day 8 番外验证用)bootcamp/nginx(Day 7)
3. K8s 部署 manifest(GitOps 通过 notes-deploy repo)
3.1 namespace + cross-WAN Service
---
apiVersion: v1
kind: Namespace
metadata:
name: chat
labels: {pod-security.kubernetes.io/enforce: baseline}
---
# vLLM 上游通过 SSH tunnel 接入 — Service + 手动 Endpoints
apiVersion: v1
kind: Service
metadata: {name: vllm-upstream, namespace: chat}
spec:
type: ClusterIP
ports:
- {port: 8000, targetPort: 30800, name: http, protocol: TCP}
---
apiVersion: v1
kind: Endpoints
metadata: {name: vllm-upstream, namespace: chat}
subsets:
- addresses: [{ip: 10.0.24.31}] # m1 hostIP, SSH tunnel
ports: [{port: 30800, name: http, protocol: TCP}]
3.2 chat-ui Deployment + Service + ServiceMonitor
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: chat-ui
namespace: chat
labels: {app.kubernetes.io/name: chat-ui}
spec:
replicas: 1
selector: {matchLabels: {app: chat-ui}}
template:
metadata: {labels: {app: chat-ui, app.kubernetes.io/name: chat-ui}}
spec:
tolerations: [{operator: Exists}]
containers:
- name: chat-ui
image: 10.0.24.28:30002/bootcamp/chat-ui:latest
imagePullPolicy: Always
ports:
- {containerPort: 8000, name: http}
- {containerPort: 9100, name: metrics}
env:
- {name: VLLM_URL, value: "http://vllm-upstream.chat.svc.cluster.local:8000/v1"}
- {name: MODEL, value: "qwen2.5-3b"}
resources:
requests: {cpu: 100m, memory: 256Mi}
limits: {cpu: 1, memory: 1Gi}
readinessProbe:
httpGet: {path: /, port: 8000}
initialDelaySeconds: 20
---
apiVersion: v1
kind: Service
metadata: {name: chat-ui, namespace: chat, labels: {app: chat-ui}}
spec:
type: NodePort
selector: {app: chat-ui}
ports:
- {port: 80, targetPort: 8000, nodePort: 30810, name: http}
- {port: 9100, targetPort: 9100, nodePort: 30811, name: metrics}
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: chat-ui
namespace: chat
labels: {release: kps}
spec:
selector: {matchLabels: {app: chat-ui}}
endpoints:
- {port: metrics, interval: 10s, path: /metrics}
4. ArgoCD Application + 跨集群依赖管理
4.1 注册 Gitea Repository(insecure HTTP)
apiVersion: v1
kind: Secret
metadata:
name: gitea-repo
namespace: argocd
labels: {argocd.argoproj.io/secret-type: repository}
stringData:
type: git
url: http://gitea-http.gitea.svc:3000/bootcamp/notes-deploy.git
username: bootcamp
password: bootcamp
insecure: "true"
4.2 Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata: {name: chat-ui, namespace: argocd}
spec:
project: default
source:
repoURL: http://bootcamp:bootcamp@gitea-http.gitea.svc:3000/bootcamp/notes-deploy.git
targetRevision: HEAD
path: chat-ui
destination:
server: https://kubernetes.default.svc
namespace: chat
syncPolicy:
automated: {prune: true, selfHeal: true}
syncOptions: [CreateNamespace=true, ServerSideApply=true]
5. ⚠️ 真坑(本 day 6 个)
坑 #1 — ArgoCD 拒 HTTP repo (gRPC handshake error)
第一次 ArgoCD 报:
transport: authentication handshake failed: context deadline exceeded
Fix: 用 K8s Secret 注册 repository + insecure: "true"(让 ArgoCD 知道这是 HTTP 不是 HTTPS)+ rollout restart argocd-repo-server 让缓存刷新
坑 #2 — Headless Service 不映射 port
第一版 manifest 用 clusterIP: None(headless):
- chat-ui 调
vllm-upstream:8000→ DNS 返回 m1 IP10.0.24.31 - chat-ui 直连
10.0.24.31:8000→ 连错端口! SSH tunnel 在 :30800 - headless Service 不做 port 映射,只 DNS
Fix: 改为 ClusterIP Service + targetPort: 30800,kube-proxy 做 port redirect
spec:
type: ClusterIP
ports:
- {port: 8000, targetPort: 30800, name: http} # ← targetPort 是关键
坑 #3 — Endpoints 跟 Service ports name 必须匹配
K8s 用 port name 匹配 Service.ports 和 Endpoints.ports:
# Service port name = "http"
# Endpoints port name MUST = "http"
不匹配 → kube-proxy 不知道把 Service traffic 转到哪个 Endpoint port
坑 #4 — ArgoCD "excluded resource" 警告
ArgoCD 默认不管 Endpoints(认为是 Service 自动生成)。如果你手动写 Endpoints,ArgoCD 会:
ExcludedResourceWarning: Resource /Endpoints vllm-upstream is excluded in the settings
Endpoints 还会被 apply 但 ArgoCD 不 track 它。手动 kubectl apply 一次,后续 ArgoCD sync 不会删它
坑 #5 — SSH tunnel 必须有 m1 上跑
# m1 上跑(持久化用 systemd):
ssh -fN -p 15128 -L 0.0.0.0:30800:<vllm-svc-ip>:8000 root@gpu1
如果 tunnel 死了,Endpoints 指向 10.0.24.31:30800 仍存在,但 chat-ui 调用全 timeout/refused。
生产对策: WireGuard / Tailscale / Cilium ClusterMesh 替代 SSH tunnel
坑 #6 — chainlit 是 WebSocket,Prometheus metric 触发要真实 chat
chat_requests_total 只在 on_message handler 跑了才 +1。我们手动 curl 调 /v1/chat/completions 绕过了 chainlit,所以 metric 不增长。
真实触发方式:
- 浏览器打开 chainlit UI(
<node-ip>:30810)+ 用户发消息 - 或者 chainlit-client Python 库 connect websocket(复杂)
文档说明: chat-ui Pod 上 9100 端口 metric 端点 work,只是数据需要真用户触发
6. 端到端验证 ✅
| 验证项 | 状态 | 证据 |
|---|---|---|
| Gitea push code | ✅ | commit 213ebd9 |
| Jenkins Pipeline 自动 build | ✅ | Build #2 SUCCESS, 6 min |
| Harbor 收 image | ✅ | bootcamp/chat-ui:dev / :latest |
| ArgoCD sync manifest | ✅ | Application Synced, Healthy |
| chat-ui Pod Running | ✅ | 1/1 Running on k8s-cp-1 |
| Pod → vllm-upstream Service routing | ✅ | OK: qwen2.5-3b 从 Pod 内拉到 |
| 跨 WAN 调用 vLLM | ✅ | 3 次中文响应 |
| ServiceMonitor scrape | ✅ | Prometheus target chat-ui "up" |
| 业务 metric 端点 | ✅ | Pod 内 :9100 返回 45 行 metric |
7. 简历金句
设计并落地端到端 AI 业务流水线:
- 业务层: chainlit Python chat UI(OpenAI-compat client + Prometheus 业务 metric)
- CI/CD: Gitea git push → Jenkins SCM trigger → Kaniko build → Harbor push 镜像 (6min 自动完成)
- CD: ArgoCD GitOps + K8s Manifest repo + 自动 sync/selfHeal
- 跨集群: 主集群(5 节点) ↔ GPU 集群(独立 k3s, A800 MIG)through SSH tunnel + K8s Service+Endpoints
- 推理后端: vLLM + Qwen2.5-3B-Instruct on A800 MIG 2g.10gb slice
- 观测: 业务 metric (TTFT / tokens / e2e latency) + GPU metric (DCGM via cross-WAN ServiceMonitor)
- 安全: Kyverno 准入 + PSA baseline + Harbor robot account
- 端到端 < 10 min 从 git commit 到生产服务
99. Day 11 完成
- [x] A — chainlit app + Dockerfile + Jenkinsfile 完整代码
- [x] B — Gitea push + Jenkins build #2 SUCCESS + Harbor 收 image
- [x] C — ArgoCD Application + 跨 WAN Service+Endpoints 部署完整,Pod Running 调 vLLM 验证
- [x] D — Prometheus 抓 chat-ui target "up" + Pod 9100 metric 端点 work(真用户触发后数据流入)