Day 14: CKA/CKS 真题演练 + 14 天 Bootcamp 终极总结
目标: 14 天能力沉淀 → 真题演练 → 简历可投 耗时: 3-4 小时 价值: 把"做过"变成"能答 + 能写"
Part 1: CKA/CKS 60 题真题演练(按 14 天能力分类)
1.1 集群安装 / 节点管理(对应 Day 0-1)
Q1. kubeadm join 一个新 worker 节点,但 join 命令丢了,怎么办?
# 在任一 cp 上重新生成 token + join 命令:
kubeadm token create --print-join-command
# 默认 token TTL 24h,生产用 --ttl 0 永久 token 但要小心
Q2. 当前节点 NotReady,kubectl describe 显示 "container runtime is down"
# 看 kubelet
systemctl status kubelet
journalctl -u kubelet --since "5min ago" | tail -50
# 看 containerd
systemctl status containerd
journalctl -u containerd --since "5min ago" | tail -50
# 修: 重启
systemctl restart containerd kubelet
# 永久: 检查 /etc/containerd/config.toml + SystemdCgroup=true
Q3. 一个 worker 节点想下线维护,怎么 graceful 操作?
# Step 1: cordon (不再调度新 Pod 来)
kubectl cordon <node>
# Step 2: drain (evict 现有 Pod,DS 除外)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=300s
# Step 3: 维护(reboot / upgrade kernel etc.)
# Step 4: uncordon (恢复调度)
kubectl uncordon <node>
Q4. 集群升级 K8s 1.30 → 1.31, 顺序?
1. 先升 control plane (一台一台):
- kubeadm upgrade plan
- kubeadm upgrade apply v1.31.0
- apt upgrade kubelet kubectl → systemctl restart kubelet
2. 再升 worker (一台一台,先 drain):
- kubectl drain <node>
- apt upgrade kubeadm
- kubeadm upgrade node
- apt upgrade kubelet kubectl → systemctl restart kubelet
- kubectl uncordon <node>
3. 验证: kubectl get nodes 全 1.31
坑: K8s 一次只能升级 minor +1(1.30→1.31,不能 1.30→1.32)
1.2 工作负载 / 调度(对应 Day 3/6)
Q5. 一个 Pod Pending,kubectl describe 显示 "0/5 nodes are available: 5 Insufficient cpu". 怎么 debug?
# Pod 申请太多
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources.requests}'
# 跟节点容量比
kubectl top nodes
kubectl describe nodes | grep -A5 "Allocated resources"
# 解法:
# 1. 减 Pod requests
# 2. 加节点
# 3. 删低优先级 Pod 腾位置 (PriorityClass + preemption)
Q6. 一个 Deployment 要保证 Pod 跨 3 zone 均匀分布,怎么写?
spec:
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # 严格,不行就 Pending
labelSelector:
matchLabels: {app: my-app}
(参考 Day 6.D)
Q7. 一个 Pod 必须跑在 GPU 节点上,且 GPU 节点带 taint nvidia.com/gpu=present:NoSchedule,怎么写?
spec:
template:
spec:
nodeSelector:
accelerator: gpu # 节点 label
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- resources:
limits:
nvidia.com/gpu: 1 # 申请 1 个 GPU
Q8. HPA 不工作 — kubectl get hpa 显示 "TARGETS unknown"
# 多半 metrics-server 没装或 cert 问题
kubectl get pod -n kube-system | grep metrics-server
kubectl logs -n kube-system <metrics-server-pod>
# 常见 fix: --kubelet-insecure-tls
kubectl patch deploy metrics-server -n kube-system --type=json -p='[
{"op": "add", "path": "/spec/template/spec/containers/0/args/-",
"value": "--kubelet-insecure-tls"}
]'
(参考 Day 6.E)
Q9. 同一个 namespace 部署 10 个 Pod,所有 Pod request CPU 加起来不能超 5C,怎么强制?
apiVersion: v1
kind: ResourceQuota
metadata: {name: cpu-limit, namespace: my-ns}
spec:
hard:
requests.cpu: "5"
limits.cpu: "10"
1.3 网络 / Service(对应 Day 1/4)
Q10. 一个 Pod 调用 Service 失败 "connection refused", Service 后端有 3 个 Endpoint, ping 也不通。怎么排查?
# 1. Service 后端 Endpoint 真的有吗?
kubectl get endpoints <svc>
# 如果空: Service.spec.selector 跟 Pod label 不匹配
# 2. Pod readiness probe pass 了吗?(没 pass 不进 Endpoint)
kubectl describe pod <backend-pod>
# 3. NetworkPolicy 拦了?
kubectl get networkpolicy -n <ns>
# 4. Service ClusterIP iptables rule
sudo iptables -t nat -L KUBE-SERVICES -n | grep <svc-cluster-ip>
# 5. kube-proxy 健康?
kubectl logs -n kube-system <kube-proxy-pod>
Q11. 一个 NetworkPolicy 写了 ingress deny-all 后,Pod 间不通了,但 DNS (CoreDNS) 也不通。怎么修?
# deny-all 时也要 allow egress to DNS (kube-system/kube-dns)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
spec:
podSelector: {}
policyTypes: [Egress]
egress:
- to:
- namespaceSelector:
matchLabels: {kubernetes.io/metadata.name: kube-system}
podSelector:
matchLabels: {k8s-app: kube-dns}
ports:
- {protocol: UDP, port: 53}
- {protocol: TCP, port: 53}
(参考 Day 4.E 默认 deny + 白名单)
Q12. 集群 Pod-to-Pod 跨节点不通(Cilium 装错?)
# Cilium 健康
cilium status
cilium connectivity test --test-namespace cilium-test
# 看 endpoint
kubectl exec -n kube-system cilium-xxx -- cilium endpoint list
1.4 存储 / Volume(对应 Day 4-5)
Q13. PVC Pending,SC 是 Longhorn,Pod 跑不起来。怎么 debug?
kubectl describe pvc <pvc> # 看 events
kubectl get sc # SC 存在且 default?
kubectl get pod -n longhorn-system | grep -E "manager|csi" # Longhorn agent OK?
kubectl get nodes.longhorn.io -n longhorn-system # node Schedulable?
(参考 Day 4.B/C 6 个 Longhorn 真坑)
Q14. PVC 扩容,改 spec.resources.requests.storage 从 5Gi → 10Gi,但 df -h 显示还是 5Gi,怎么办?
# Longhorn 已在 block 层扩,但 ext4 filesystem 没扩
kubectl get pvc <pvc> -o yaml | grep -A 2 conditions
# 看到 FileSystemResizePending: Waiting for user to restart pod
# 重启 Pod 让 kubelet 触发 resize2fs
kubectl delete pod <pod-using-pvc>
(参考 Day 5.A 真实操作)
Q15. VolumeSnapshot 用法?
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata: {name: my-snapshot}
spec:
volumeSnapshotClassName: longhorn-snapshot-class
source: {persistentVolumeClaimName: my-pvc}
---
# 从 snapshot 恢复
apiVersion: v1
kind: PersistentVolumeClaim
metadata: {name: restored-pvc}
spec:
storageClassName: longhorn
dataSource:
name: my-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
(参考 Day 4.C — restored 7B vs 5B 行数对比)
1.5 安全(CKS 核心,对应 Day 5)
Q16. RBAC: 创建一个只读用户 only-view,只能 get/list pods/configmaps 在 my-ns
---
apiVersion: v1
kind: ServiceAccount
metadata: {name: only-view, namespace: my-ns}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata: {name: pod-cm-reader, namespace: my-ns}
rules:
- apiGroups: [""]
resources: ["pods", "configmaps"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: {name: only-view-binding, namespace: my-ns}
subjects:
- kind: ServiceAccount
name: only-view
namespace: my-ns
roleRef:
kind: Role
name: pod-cm-reader
apiGroup: rbac.authorization.k8s.io
(参考 Day 5.C 7 场景验证)
Q17. kubectl auth can-i 怎么用?
# 当前用户能 create deploy 吗?
kubectl auth can-i create deployments
# 模拟特定 SA
kubectl auth can-i create deployments \
--as=system:serviceaccount:my-ns:only-view
Q18. PSA enforce restricted 后,Pod 起不来,Error: "violates PodSecurity restricted"。怎么改 Pod 让它合规?
spec:
securityContext: # Pod 级
runAsNonRoot: true
runAsUser: 65534
seccompProfile: {type: RuntimeDefault}
containers:
- securityContext: # 容器级
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 65534
readOnlyRootFilesystem: true
capabilities: {drop: [ALL]}
(参考 Day 5.D — 5 项缺一不可)
Q19. Secret 默认在 etcd 里只 base64 不加密,怎么真加密?
# /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources: [secrets]
providers:
- aescbc:
keys:
- name: key1
secret: <base64 32-byte 密钥>
- identity: {} # 兜底读旧 Secret
# kube-apiserver 加 --encryption-provider-config=...
# 重启 apiserver (static pod 自动重启)
# rewrite 所有 Secret 让加密真生效
kubectl get secrets -A -o json | kubectl replace -f -
(参考 Day 5.E — etcdctl 直查从明文变密文)
Q20. 写一个 Kyverno policy 拒绝 image: latest
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: {name: disallow-latest-tag}
spec:
validationFailureAction: Enforce
rules:
- name: require-image-tag
match:
any:
- resources: {kinds: [Pod]}
validate:
message: "Image must have tag (not :latest)"
pattern:
spec:
containers:
- image: "!*:latest & *:*"
(参考 Day 5.F)
1.6 etcd / 灾难恢复(对应 Day 12)
Q21. etcd 单 cp 节点 etcd 数据完全丢,怎么恢复?
# 1. cluster 仍有 2/3 quorum,先 member remove
etcdctl member remove <dead-member-id>
# 2. 在死的 cp 节点清空 data dir
mv /var/lib/etcd /var/lib/etcd-broken-backup
# 3. member add
etcdctl member add <cp-name> --peer-urls=https://<cp-ip>:2380
# 4. 改 etcd manifest --initial-cluster-state=existing + 完整 initial-cluster 列表
# 5. 启 etcd Pod (mv manifest 回 /etc/kubernetes/manifests/)
# etcd 自动从 leader 同步数据
Total: 2 min recovery(实测 Day 12.A)
Q22. 整个 cluster 死了(机房断电),只有 snapshot, 怎么救?
# 1. 在还有 snapshot 的节点 restore
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
--name k8s-cp-1 \
--initial-cluster k8s-cp-1=https://<cp1-ip>:2380 \
--initial-cluster-token new-token \
--initial-advertise-peer-urls https://<cp1-ip>:2380 \
--data-dir /var/lib/etcd-new
# 2. mv /var/lib/etcd-new → /var/lib/etcd
# 3. 改 etcd manifest --initial-cluster-state=new (新 cluster)
# 4. 启 cp1, 其他 cp 重新 kubeadm join
Q23. apiserver cert 即将过期(还剩 7 天),怎么续?
kubeadm certs check-expiration
kubeadm certs renew all
systemctl restart kubelet
# 让 cp 上 static pod (apiserver/controller/scheduler) 重启
Q24. 一个 Worker 节点 OOM,所有 Pod 被 Evicted,怎么处理?
# 1. drain 节点(把残余 Pod 都赶走)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# 2. 找元凶
kubectl top pods -A --sort-by=memory | head -10
kubectl delete pod <mem-hog>
# 3. 长期防御
# - namespace LimitRange + ResourceQuota
# - Kyverno require-resources policy enforce
# - Prometheus alert mem_util > 85%
(参考 Day 12.B)
Q25. Cilium 全集群 down 了,新 Pod 卡 Pending,怎么办?
# 检查
kubectl get pods -n kube-system -l k8s-app=cilium
# 如果空: helm upgrade cilium 重装
helm upgrade cilium cilium/cilium --reuse-values -n kube-system
# 节点 taint 自动会被 cilium-operator 清(agent ready 后)
sleep 60
kubectl get nodes -o json | grep -A 3 cilium-taints
(参考 Day 12.C)
1.7 观测 / 监控(对应 Day 6/9)
Q26. Prometheus 装在 monitoring ns,怎么让它抓我新部署的 app metrics?
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: my-app-ns
labels:
release: kps # ⚠️ kube-prometheus-stack 抓这 label
spec:
selector:
matchLabels: {app: my-app}
endpoints:
- port: metrics # Service.ports.name 必须匹配
interval: 15s
path: /metrics
(参考 Day 6.F + Day 11.D)
Q27. PrometheusRule 写自定义 alert
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels: {release: kps}
spec:
groups:
- name: bootcamp
rules:
- alert: NodeCPUHigh
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 2m
labels: {severity: warning}
annotations:
summary: 'Node CPU > 80%'
(参考 Day 8 alt.B)
Q28. AlertManager 怎么接钉钉?
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: dingtalk-routes
labels: {alertmanagerConfig: my-org}
spec:
route:
receiver: dingtalk
routes:
- matchers: [{name: severity, value: critical}]
receiver: dingtalk
receivers:
- name: dingtalk
webhookConfigs:
- url: http://prometheus-webhook-dingtalk.monitoring:8060/dingtalk/critical/send
sendResolved: true
(参考 Day 8 alt.A — webhook-mock 验证)
1.8 AI Infra / GPU(对应 Day 8-10)
Q29. 一个节点想跑 LLM 推理,装哪些组件?
1. NVIDIA driver (节点)
2. nvidia-container-toolkit (container runtime 集成)
3. containerd 配 nvidia runtime
4. NVIDIA Device Plugin (DaemonSet, K8s 调度知道 GPU)
5. (可选) DCGM Exporter (GPU metrics 到 Prometheus)
6. (可选) GPU Operator (一键装上面所有)
(参考 Day 8.B)
Q30. A100/A800 怎么切片?
# 1. 启 MIG mode
nvidia-smi -i 0 -mig 1
nvidia-smi --gpu-reset
# 2. 创 GI + CI
nvidia-smi mig -cgi 2g.10gb,2g.10gb,2g.10gb -C
# 3. Device Plugin 配 MIG_STRATEGY=single
(参考 Day 10.A — Xorg 占 GPU 真坑)
Q31. vLLM 服务 HPA based on 什么指标?
不能用 CPU/Memory (LLM 服务 CPU 利用率永远低,GPU 是瓶颈)
不能用 GPU util (永远 80%+,扩到 max 还是高)
正确: vllm:num_requests_waiting (队列深度)
(参考 Day 10.C)
Q32. 7B 模型在 10GB GPU slice 上能跑吗?
Qwen2.5-7B (BF16): 14 GB → 装不下
Qwen2.5-7B-AWQ Q4: 3.6 GB 模型 + ~2 GB KV cache → 5.6 GB ✅
(参考 Day 10.B)
1.9 CI/CD / GitOps(对应 Day 7-8)
Q33. Jenkins 在 K8s 集群里跑,怎么让 Pipeline 用 K8s 动态 agent?
pipeline {
agent {
kubernetes {
yaml '''
apiVersion: v1
kind: Pod
spec:
containers:
- name: kaniko
image: gcr.io/kaniko-project/executor:debug
command: [/busybox/cat]
tty: true
'''
}
}
stages {
stage('build') {
steps {
container('kaniko') {
sh '/kaniko/executor --dockerfile=Dockerfile --context=$(pwd) --destination=...'
}
}
}
}
}
(参考 Day 8 番外 + Day 11)
Q34. ArgoCD Application 怎么写?
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata: {name: my-app, namespace: argocd}
spec:
project: default
source:
repoURL: http://gitea.gitea.svc:3000/team/manifest.git
targetRevision: HEAD
path: prod
destination:
server: https://kubernetes.default.svc
namespace: prod
syncPolicy:
automated: {prune: true, selfHeal: true}
syncOptions: [CreateNamespace=true]
(参考 Day 7.B + Day 11.C)
Q35. ArgoCD 拒 HTTP repo,怎么办?
apiVersion: v1
kind: Secret
metadata:
name: gitea-repo
namespace: argocd
labels: {argocd.argoproj.io/secret-type: repository}
stringData:
type: git
url: http://gitea.gitea.svc:3000/team/repo.git
username: u
password: p
insecure: "true"
(参考 Day 11 真坑 #1)
1.10 进阶 / 综合(对应 Day 11/13)
Q36-50 见 扩展题库 PDF
(实际 CKA/CKS 60 题中已涵盖核心 35 题,补充 25 题主题包括: Webhook / CRD / Operator / Cilium L7 / Hubble / Volume Snapshot / 等等,可对应每 Day 的 mini-book 章节复习)
Part 2: 14 天 Bootcamp 终极总结
2.1 集群能力地图
┌─────────────────────────────────────────────────────────────────────┐
│ 5 节点 K8s 1.30 主集群 (10.0.24.0/24) │
│ │
│ Control Plane (HA): │
│ ├─ 3× kubeadm cp + per-node HAProxy (Day 1) │
│ ├─ etcd 3-member quorum (Day 12 demo recovery 2min) │
│ └─ apiserver audit + EncryptionConfig aescbc-256 (Day 5/6) │
│ │
│ Network: │
│ ├─ Cilium 1.16.5 (Day 1) + Hubble Relay/UI (Day 4.D) │
│ ├─ NetworkPolicy + CiliumNetworkPolicy L7 (Day 4.E/F) │
│ ├─ WireGuard 透明加密 (Day 7.C) │
│ └─ node-local-dns (Day 1) │
│ │
│ Storage: │
│ ├─ Longhorn 5 节点 634GB池 (Day 4.B / Day 5.A 加二盘) │
│ ├─ VolumeSnapshot CRD (Day 4.C) │
│ └─ MIG 3 slice on GPU 节点 (Day 10.A) │
│ │
│ Workload Platform: │
│ ├─ SimpleApp + LLMService Operator (Day 3 + Day 13.A) │
│ ├─ Harbor 私有 registry (Day 7.A) │
│ ├─ ArgoCD GitOps (Day 7.B / Day 11) │
│ ├─ Gitea + Jenkins + Kaniko (Day 8 番外) │
│ └─ Kyverno 准入 + PSA enforce (Day 5) │
│ │
│ Observability: │
│ ├─ kube-prometheus-stack (Day 6.F): 28 target up │
│ ├─ Loki + Promtail 日志 (Day 6.G) │
│ ├─ AlertManager + 5 自定义 PrometheusRule (Day 8 alt) │
│ └─ DCGM Exporter 跨 WAN 接 Prometheus (Day 9) │
│ │
│ Disaster Recovery (Day 12): │
│ ├─ etcd snapshot save/restore │
│ ├─ Cilium 灾难 helm upgrade 恢复 │
│ ├─ OOM 雪崩 LimitRange 救场 │
│ └─ 5 SOP 应急手册 │
└─────────────────────────────────────────────────────────────────────┘
↕ 跨 WAN (SSH tunnel + Service+Endpoints + 公网)
┌─────────────────────────────────────────────────────────────────────┐
│ GPU 节点 k3s 1.35 (A800-40G, MIG ×3) │
│ │
│ ├─ NVIDIA Container Toolkit + Device Plugin (MIG single mode) │
│ ├─ vLLM Qwen2.5-3B + Qwen2.5-7B-AWQ (Day 8/10) │
│ ├─ DCGM Exporter 19 个 GPU metrics │
│ └─ chat-ui (Day 11) + RAG ChromaDB (Day 13.D 设计) │
└─────────────────────────────────────────────────────────────────────┘
2.2 14 天命脉数据
| 项 | 数据 |
|---|---|
| 总文档 | 14 份 mini-book(各 ~5000+ 字 + 6 维度日志) |
| 文档总字数 | ~120,000 字 |
| 踩过的真坑 | 50+ (跨 day 汇总) |
| 集群节点 | 5 主 + 1 GPU |
| 装的组件 | 20+ (Cilium / Longhorn / Harbor / ArgoCD / Jenkins / Gitea / Prometheus / Grafana / Loki / Hubble / vLLM / DCGM / Kyverno / KEDA / chainlit / 等) |
| 总能力点 | 50+ (每 Day 含 5-8 子能力) |
| 跨 WAN 难题 | 1 GPU 节点公网调试 + 3 种 tunnel 方案 |
| AI 推理 | A800-40G + Qwen2.5-3B 30 并发 2901 tok/s,P99 1.3s |
2.3 简历完整版(可直接投)
个人技术作品 — Kubernetes + AI Infra 平台 (2026 年自学项目)
14 天高强度自学,从 0 搭建生产级 K8s 平台 + AI Infra,集成 20+ 主流组件,7 个完整 Day 实操 + 7 个 Day 综合落地。
核心能力(分模块)
集群与网络
- HA K8s 1.30 集群(3-master + 2-worker + 1-GPU 跨 WAN)kubeadm 部署,per-node HAProxy 替代 keepalived VIP
- Cilium 1.16 CNI(eBPF + Hubble + L7 NetworkPolicy + WireGuard 透明加密 mesh)
- node-local-dns + CoreDNS + 跨集群 SSH tunnel(替代 ClusterMesh)
存储与备份
- Longhorn 分布式块存储(5 节点 634GB 池, 3-replica anti-affinity, VolumeSnapshot 秒级)
- PVC 在线扩容 + drain 节点不丢数据
- etcd snapshot save/restore(单 cp 损毁 2 分钟恢复,实测验证)
安全与准入
- RBAC 最小权限 SA(7 场景 SAR 验证)
- Pod Security Admission restricted enforce
- Secret at-rest 加密(EncryptionConfiguration aescbc-256,3 cp 滚动重启零中断)
- Kyverno + PSA + LimitRange + ResourceQuota 多层防御
观测体系
- kube-prometheus-stack(Prometheus / Grafana / AlertManager / 28 target up)
- Loki + Promtail 日志聚合(5 节点 + 7 ns)
- DCGM Exporter 跨 WAN 接主集群 Prometheus(19 个 GPU metrics 实时)
- AlertManager 真接入(AlertmanagerConfig CRD,28 firing alert 实测)
CI/CD 与 GitOps
- Harbor 私有镜像仓库(8 component,Trivy 漏扫)
- Jenkins + Kaniko (no-Docker-daemon) build → Harbor push,git commit 到镜像 6 min
- ArgoCD GitOps(Application + ApplicationSet + App of Apps),selfHeal + prune
- Gitea 自建 git(集群内闭环)
AI Infra(GPU + LLM)
- NVIDIA A800-40G + MIG 3×2g.10gb 硬件切片,Device Plugin MIG_STRATEGY=single
- vLLM 部署 Qwen2.5-3B (BF16) + Qwen2.5-7B-AWQ (Q4 量化,显存仅 3.8GB)
- 性能 benchmark:30 并发 2901 tok/s,P99 1.3s,GPU util 79%
- 自研 LLM Operator(kubebuilder + controller-runtime)— LLMService CRD,自动建 Deployment + Service + HPA
- chainlit chat UI 端到端 + Prometheus 业务 metric + 跨集群调 vLLM
- RAG 设计(ChromaDB + sentence-transformers + LLM 检索增强)
灾难恢复 / SRE
- 编写 5 个真生产事故 SOP(etcd / OOM / Cilium / cert / 全集群灾难)
- 实测 3 类灾难注入恢复(etcd 节点损毁 / OOM 雪崩 / Cilium 全 down)
- 平均 RTO < 25 min
Operator / 自定义资源
- kubebuilder v4 + controller-runtime 实现 SimpleApp + LLMService 两个 CRD
- Reconcile 三件套(Deployment/Service/HPA)+ Finalizer + OwnerReference 级联 + status 状态机
技术债务 / 局限性认知
也踩过坑、绕过过、规划但未做完,面试时也会主动讲:
- 跨 WAN K8s 加入主集群 Cilium VXLAN 调不通 → 切独立 k3s + SSH tunnel,真实场景应用 Karmada / Cilium ClusterMesh
- Karmada 国内 mirror 限制装不完整,但完整设计 + ClusterMesh 替代方案
- K3s CoreDNS 在我们环境长期 NotReady,业务 Pod 用 dnsConfig 绕过(生产应深入 debug)
- Triton image 16GB 跨 WAN 拉不下来,做了 model repo + 完整 yaml,实际 demo 用 vLLM 替代
这些不是失败,是**"做过 + 知道为啥失败 + 知道生产怎么做对"**。
项目仓库
14 天文档(每 Day mini-book 级,~5000+ 字):
~/Downloads/k8s-2week-bootcamp/Day{0..14}-*.md
99. Day 14 完成 + 14 天 Bootcamp 收官
- [x] Part 1: 35 道 CKA/CKS 真题 + 答案(对应 14 天能力)
- [x] Part 2: 14 天集群能力地图
- [x] Part 3: 简历完整版(可直接投)
- [x] 14 天 mini-book 文档 全部归档
最终成绩:
- 14 个 mini-book 文档,总字数 ~120K
- 50+ 真坑(全文档化)
- 20+ 主流组件实操
- 1 个完整 AI 推理平台
- 1 个自研 LLM Operator
- 1 套 SRE 灾难恢复 SOP
🎉 K8s + AI Infra Bootcamp 14 天闭环完成