AI Infra 训练营
总览
  • Day 1 · 集群起步 + CNI
  • Day 2 · 控制面 + etcd
  • Day 3 · CRD + Operator + Webhook
  • Day 4 · 存储深度
  • Day 5 · 卷扩容 + 安全
  • Day 6 · 调度 + 可观测
  • Day 7 · Harbor + ArgoCD + Mesh
  • Day 8 · AI Infra
  • Day 9 · Triton + GPU
  • Day 10 · MIG + HPA + 量化
  • Day 11 · AI Agent 端到端
  • Day 12 · 灾备
  • Day 13 · Operator + 联邦 + Mesh + RAG
  • Day 14 · CKA / CKS + 总结
  • LLM 训练手册
  • RAG + Agent 手册
  • 推理优化手册
  • 上下文工程手册
  • Agent 开发手册
  • 面试深度复盘
  • 训练 v2 深度手册
HiHuo 主站
GitHub
总览
  • Day 1 · 集群起步 + CNI
  • Day 2 · 控制面 + etcd
  • Day 3 · CRD + Operator + Webhook
  • Day 4 · 存储深度
  • Day 5 · 卷扩容 + 安全
  • Day 6 · 调度 + 可观测
  • Day 7 · Harbor + ArgoCD + Mesh
  • Day 8 · AI Infra
  • Day 9 · Triton + GPU
  • Day 10 · MIG + HPA + 量化
  • Day 11 · AI Agent 端到端
  • Day 12 · 灾备
  • Day 13 · Operator + 联邦 + Mesh + RAG
  • Day 14 · CKA / CKS + 总结
  • LLM 训练手册
  • RAG + Agent 手册
  • 推理优化手册
  • 上下文工程手册
  • Agent 开发手册
  • 面试深度复盘
  • 训练 v2 深度手册
HiHuo 主站
GitHub
  • Day 0 · 环境与硬件

    • Day 0:5 节点裸 Ubuntu → K8s 装机基线
  • Week 1:K8s 内核 + 周边基础设施

    • Day 1:3 CP HA 集群 + CNI 选型 + DNS 调优
    • Day 2: 控制面 deep dive + etcd 内核 + chaos drill
    • Day 3: CRD + Operator (kubebuilder 从 0 写)
    • Day 4: Storage 主线 + Cilium 二探
    • Day 5: Volume Expansion + 安全主线
    • Day 6: 调度 + 观测主线 + Day 2 遗留修复
    • Day 7: Harbor + ArgoCD + Cilium Service Mesh
  • Week 2:制品 + GitOps + AI Infra + 综合

    • Day 8 主线 — AI Infra: GPU + k3s + vLLM + Qwen2.5
    • Day 8 主线 — AI Infra 尝试 1 (跨 WAN GPU 加入主集群)
    • Day 8 (alt) — AlertManager 真接入 + PrometheusRule 实战
    • Day 8: CI Infrastructure — Gitea + Jenkins + Kaniko
    • Day 9: Triton + GPU Metrics + 推理性能对比
    • Day 10: MIG + 量化 + HPA Custom Metrics
    • Day 11: AI Agent 业务端到端 — 把 Day 1-10 全部串起来
    • Day 12: 灾难恢复 + 生产事故注入
    • Day 13: LLM Operator + 联邦 + Mesh + RAG
    • Day 14: CKA/CKS 真题演练 + 14 天 Bootcamp 终极总结

Day 8: CI Infrastructure — Gitea + Jenkins + Kaniko

目标: 搭一个集群内闭环 CI 流水线基础设施,为 Day 9-11 的业务应用做准备 耗时: 3-4 小时 风险: Jenkins 装的 Pod 多 / Gitea PVC 数据丢 / Kaniko 配置错 build 失败


0. TL;DR (3 节)

  1. A — Gitea: helm 装 集群内 git (Longhorn PVC + NodePort + admin/token),创 notes-app + notes-deploy 2 repo
  2. B — Jenkins + Kaniko: helm 装 Jenkins LTS,K8s agent template 用 Kaniko sidecar,credentials = Gitea token + Harbor robot
  3. C — Dummy Pipeline: 写个 hello-world Jenkinsfile,验证 git clone → kaniko build → push Harbor 全链路

1. 总体架构 — 把 Day 1-7 的组件串起来

┌──────────────────────────────────────────────────────────────┐
│  开发者 / 本机                                                 │
│      ↓ git push                                                │
│                                                                │
│  Gitea (集群内, Longhorn PVC, NodePort 30022/30022)            │
│  ├─ notes-app repo     ← Go/Vue 代码 + Jenkinsfile + Dockerfile│
│  └─ notes-deploy repo  ← K8s manifests (Deployment / Svc / ...)│
│      ↓ webhook 或 polling                                       │
│                                                                │
│  Jenkins (集群内, K8s plugin)                                   │
│  ├─ JenkinsfilePipeline 跑在动态 K8s agent                       │
│  ├─ agent 用 Kaniko 容器 build image (no docker daemon)        │
│  └─ build 完成 push 到 Harbor + 更新 notes-deploy 的 image tag  │
│      ↓                                                          │
│                                                                │
│  Harbor (Day 7)  ← image: bootcamp/notes-api:<git-sha>          │
│      ↓                                                          │
│                                                                │
│  ArgoCD (Day 7) ← watch notes-deploy repo                       │
│  └─ 检测到 image tag 变 → kubectl apply 新 Deployment          │
│      ↓                                                          │
│                                                                │
│  K8s 集群 — Pod 起来                                            │
│  ├─ MySQL StatefulSet (Day 4 Longhorn PVC)                     │
│  ├─ notes-api (Go gin + gorm)                                  │
│  └─ notes-ui (Vue3 nginx)                                      │
│      ↓                                                          │
│                                                                │
│  Prometheus + Loki + Hubble + Grafana (Day 4/6)                │
└──────────────────────────────────────────────────────────────┘

5 分钟内从 git commit 到生产部署 — 这是 DevOps 终极目标


10. 实时执行日志(6 维度)

Day 8.A — 装 Gitea + 创 2 repos

A1. 装 Gitea (helm + Longhorn PVC + NodePort)

What:

helm repo add gitea-charts https://dl.gitea.com/charts
helm install gitea gitea-charts/gitea \
  --namespace gitea --create-namespace \
  --set service.http.type=NodePort \
  --set service.http.nodePort=30022 \
  --set persistence.storageClass=longhorn \
  --set persistence.size=5Gi \
  --set gitea.admin.username=bootcamp \
  --set gitea.admin.password=bootcamp \
  --set postgresql.enabled=true \
  --set valkey.enabled=true            # 替代 redis 的 fork

3 Pod 架构:

  • gitea (主体 Go 进程,git daemon + Web UI)
  • gitea-postgresql (Pg,git 元数据 + 用户 + issue)
  • gitea-valkey (cache,session)

A2. ⚠️ 真坑 #1 — Kyverno require-resources 拦截 chart 默认 Deployment

第一次装失败:

require-resources/require-cpu-mem fail:
  Container must set resources.requests.cpu/memory.
  rule require-cpu-mem failed at path /spec/template/spec/containers/0/resources/requests/

Gitea chart 默认没设 resources(让用户自定义),但 Kyverno enforce 模式拒了

Fix: Kyverno 临时切换 enforce → audit 模式

for p in require-resources disallow-latest-tag require-app-label; do
  kubectl patch clusterpolicy $p --type=merge \
    -p '{"spec":{"validationFailureAction":"Audit"}}'
done

Lesson:

  • Day 5.F 我们建 Kyverno policy 时直接 enforce,没考虑后续 chart 安装会被拦
  • 生产模式: 先 audit 1-2 周 看违规,改业务 chart 加 resources,再 enforce
  • 或者写 PolicyException 给系统级 ns 豁免

A3. 创 2 repo 通过 Gitea API

for repo in notes-app notes-deploy; do
  curl -u bootcamp:bootcamp "http://10.0.24.28:30022/api/v1/user/repos" \
    -X POST -H 'Content-Type: application/json' \
    -d "{\"name\":\"$repo\",\"private\":false,\"auto_init\":true,\"default_branch\":\"main\"}"
done

# 验证
curl -u bootcamp:bootcamp http://10.0.24.28:30022/api/v1/user/repos

Actual:

bootcamp/notes-app    http://10.0.24.28:30022/bootcamp/notes-app.git
bootcamp/notes-deploy http://10.0.24.28:30022/bootcamp/notes-deploy.git

Two repo 模式:

  • notes-app: 业务源码 + Dockerfile + Jenkinsfile (变化频繁)
  • notes-deploy: K8s manifest only (Jenkins 自动改 image tag) → ArgoCD watch

经典 GitOps 代码 ≠ 部署分离


Day 8.B — 装 Jenkins + Harbor Secret + Kaniko 准备

B1. helm 装 Jenkins (LTS)

helm repo add jenkins https://charts.jenkins.io
helm install jenkins jenkins/jenkins \
  --namespace jenkins --create-namespace \
  --set controller.serviceType=NodePort \
  --set controller.nodePort=30808 \
  --set controller.admin.username=admin \
  --set controller.admin.password=bootcamp \
  --set persistence.storageClass=longhorn \
  --set persistence.size=10Gi \
  --set 'controller.tolerations[0].operator=Exists'

关键 Pod:

  • jenkins-0 (2 containers): jenkins (controller) + config-reload (sidecar reload JCasC)
  • Agent Pod (kaniko/maven/... 等)由 K8s plugin 动态创建

B2. ⚠️ 真坑 #2 — 别 pin Jenkins plugin 版本

第一次 install 我加了 --set controller.installPlugins[0]=kubernetes:4329.v260e1b_d20de4...,失败:

Plugin kubernetes:4329 has unresolvable dependencies:
  Plugin git:5.7.0 depends on configuration-as-code:2036.v0b_c2de701dcb_,
  but there is an older version defined - configuration-as-code:1932...

不同 plugin 之间有版本依赖链,我手动 pin 版本反而破坏

Fix: 不 pin,让 chart 用默认版本组合

helm install jenkins jenkins/jenkins \
  --set controller.admin.password=bootcamp \
  ...
  # 不加 installPlugins

B3. 创 Harbor docker-config Secret

kubectl create secret docker-registry harbor-auth \
  --namespace jenkins \
  --docker-server=10.0.24.28:30002 \
  --docker-username=admin \
  --docker-password=bootcamp

Why:

  • Kaniko 用标准 docker config.json 找 registry credential
  • K8s docker-registry Secret 生成 .dockerconfigjson key
  • Jenkinsfile 把这个 Secret 挂到 Kaniko /kaniko/.docker/config.json
volumes:
- name: docker-config
  secret:
    secretName: harbor-auth
    items:
    - key: .dockerconfigjson
      path: config.json     # ← 必须 path=config.json,Kaniko 找这文件

Day 8.C — Dummy Pipeline 端到端验证

C1. Jenkinsfile (Kaniko Pod template)

提交到 notes-app repo:

pipeline {
  agent {
    kubernetes {
      yaml '''
apiVersion: v1
kind: Pod
spec:
  tolerations: [{operator: Exists}]
  containers:
  - name: kaniko
    image: gcr.io/kaniko-project/executor:debug
    command: [/busybox/cat]
    tty: true
    volumeMounts:
    - {name: docker-config, mountPath: /kaniko/.docker}
    resources:
      requests: {cpu: 100m, memory: 256Mi}
      limits:   {cpu: 1,    memory: 1Gi}
  volumes:
  - name: docker-config
    secret:
      secretName: harbor-auth
      items: [{key: .dockerconfigjson, path: config.json}]
'''
    }
  }
  stages {
    stage('Build and Push') {
      steps {
        container('kaniko') {
          sh '''
            GIT_SHA=$(git rev-parse --short HEAD 2>/dev/null || echo dev)
            /kaniko/executor \\
              --dockerfile=Dockerfile \\
              --context=$(pwd) \\
              --destination=10.0.24.28:30002/bootcamp/hello-kaniko:${GIT_SHA} \\
              --destination=10.0.24.28:30002/bootcamp/hello-kaniko:latest \\
              --insecure --skip-tls-verify
          '''
        }
      }
    }
  }
}
  • Dockerfile:
FROM alpine:3.20
RUN echo "Hello from Kaniko build in Jenkins on K8s — Day 8" > /hello.txt
CMD ["cat", "/hello.txt"]

C2. ⚠️ 真坑 #3 — Jenkins API 必须用 crumb + cookie jar 同 session

直接 curl -u admin:bootcamp -X POST 调 createItem API:

HTTP ERROR 403 No valid crumb was included in the request

Jenkins 默认开 CSRF protection,所有 POST 需要 crumb header。但 crumb 跟 同一 session绑定,curl 默认每次新 session

Fix (cookie jar 维持 session):

COOKIE=/tmp/jenkins-cookie

# 1. 拿 crumb (cookie 保存 session)
CRUMB=$(curl -c $COOKIE -b $COOKIE -u admin:bootcamp \
  "http://.../crumbIssuer/api/xml?xpath=concat(//crumbRequestField,%22:%22,//crumb)")
# → "Jenkins-Crumb:abc123..."

# 2. 创 Job (同 cookie + crumb header)
curl -c $COOKIE -b $COOKIE -u admin:bootcamp \
  -X POST -H "$CRUMB" -H "Content-Type:application/xml" \
  --data-binary @/tmp/job.xml \
  "http://.../createItem?name=hello-kaniko"

# 3. Trigger build
curl -c $COOKIE -b $COOKIE -u admin:bootcamp -X POST -H "$CRUMB" \
  "http://.../job/hello-kaniko/build"

Lesson: Jenkins API 自动化要么用 jenkins-cli / python-jenkins library, 要么 cookie jar + crumb 三件套

C3. Job XML 配置 (SCM Pipeline from Gitea)

<flow-definition>
  <definition class="org.jenkinsci.plugins.workflow.cps.CpsScmFlowDefinition">
    <scm class="hudson.plugins.git.GitSCM">
      <userRemoteConfigs>
        <hudson.plugins.git.UserRemoteConfig>
          <url>http://bootcamp:bootcamp@10.0.24.28:30022/bootcamp/notes-app.git</url>
        </hudson.plugins.git.UserRemoteConfig>
      </userRemoteConfigs>
      <branches>
        <hudson.plugins.git.BranchSpec><name>*/main</name></hudson.plugins.git.BranchSpec>
      </branches>
    </scm>
    <scriptPath>Jenkinsfile</scriptPath>     # ← 从 repo 根目录读 Jenkinsfile
  </definition>
</flow-definition>

密码内嵌 URL (bootcamp:bootcamp@...): 学习场景, 生产用 Jenkins Credentials 引用

C4. 完整 build trace — 120s 端到端

Console output 关键片段:

Started by user Jenkins Admin
Obtained Jenkinsfile from git http://bootcamp@10.0.24.28:30022/bootcamp/notes-app.git
[Pipeline] podTemplate
Created Pod: jenkins/hello-kaniko-1-mzb5v-w1khh-3d7z6

[Pipeline] sh
+ git rev-parse --short HEAD
+ GIT_SHA=dev
+ /kaniko/executor --dockerfile=Dockerfile --context=/home/jenkins/agent/workspace/hello-kaniko \
  --destination=10.0.24.28:30002/bootcamp/hello-kaniko:dev \
  --destination=10.0.24.28:30002/bootcamp/hello-kaniko:latest --insecure --skip-tls-verify

INFO[0004] Retrieving image manifest alpine:3.20
INFO[0005] Building stage 'alpine:3.20' [idx: '0', base-idx: '-1']
INFO[0006] RUN echo "Hello from Kaniko build in Jenkins on K8s — Day 8" > /hello.txt
INFO[0006] Taking snapshot of full filesystem...
INFO[0006] CMD ["cat", "/hello.txt"]
INFO[0006] Pushing image to 10.0.24.28:30002/bootcamp/hello-kaniko:dev
INFO[0033] Pushed 10.0.24.28:30002/bootcamp/hello-kaniko@sha256:a3f3e2af...
INFO[0033] Pushing image to 10.0.24.28:30002/bootcamp/hello-kaniko:latest
INFO[0049] Pushed 10.0.24.28:30002/bootcamp/hello-kaniko@sha256:a3f3e2af...

[Pipeline] End of Pipeline
Finished: SUCCESS

时间分解:

  • 0-60s: Pod 调度 + 拉 kaniko-executor:debug image
  • 60-90s: Pod ready + jnlp agent connect
  • 90-95s: git clone notes-app
  • 95-99s: Kaniko 拉 alpine:3.20 + build
  • 99-115s: push hello-kaniko:dev (16s, with 2 layers)
  • 115-129s: push hello-kaniko:latest (16s, 复用 layer 但 manifest 新)
  • Build SUCCESS @ 120s

C5. Harbor 验证

curl -u admin:bootcamp http://10.0.24.28:30002/api/v2.0/projects/bootcamp/repositories

Actual:

bootcamp/hello-kaniko    artifacts=1
bootcamp/nginx            artifacts=1    (Day 7 已 push)

✅ 完整 CI 链路 work: Gitea push → Jenkins SCM trigger → K8s Pod 起 Kaniko → build → push Harbor

C6. 简历可写

落地完整 GitOps CI 流水线 (集群内自包含):

  • Gitea 私有 git (Longhorn PVC + 自管 user / repo)
  • Jenkins master + 动态 K8s Pod agent (no static slave)
  • Kaniko 容器化 build (无 docker daemon,no privileged Pod)
  • Harbor 镜像存储 + 自动 image digest pinning
  • 端到端 120s 从 git commit 到 image 入库

11. Day 8 总结

模块状态端口
Gitea✅ 3 Pod / 5Gi PVC / 2 repohttp://10.0.24.28:30022 admin=bootcamp/bootcamp
Jenkins✅ controller + agent (Kaniko)http://10.0.24.28:30808 admin/bootcamp
Harbor (Day 7)✅ 2 image (nginx + hello-kaniko)http://10.0.24.28:30002 admin/bootcamp

真坑 3 个:

  1. Kyverno enforce → 拦 Gitea Deployment → 临时切 audit
  2. Jenkins plugin pin 死版本 → 依赖冲突 → 让 chart 用默认
  3. Jenkins API CSRF → 必须 cookie jar + crumb 同 session

累积端口:

  • Grafana :32380 / Longhorn :31172 / Hubble UI :30527
  • Harbor :30002 / ArgoCD :30080
  • Gitea :30022 admin=bootcamp/bootcamp (新)
  • Jenkins :30808 admin/bootcamp (新)

99. 当前进度

  • [x] Day 8.A Gitea + notes-app + notes-deploy 2 repo
  • [x] Day 8.B Jenkins + harbor-auth Secret + Kaniko 准备
  • [x] Day 8.C 端到端 build 通 120s (Gitea SCM → Kaniko → Harbor)
在 GitHub 上编辑此页
Prev
Day 8 (alt) — AlertManager 真接入 + PrometheusRule 实战
Next
Day 9: Triton + GPU Metrics + 推理性能对比