Day 13: LLM Operator + 联邦 + Mesh + RAG
目标: 4 个面试高频深度主题一日过 耗时: 5-6 小时(以 Day 13.A + Day 13.D 实操为主,B/C 偏理论)
0. TL;DR
| 段 | 主题 | 实操深度 |
|---|---|---|
| A LLM Operator (LLMService CRD) | Day 3 SimpleApp 演进 | ✅ 真 build + CRD 装上集群 |
| B Karmada 多集群联邦 | 主集群 + GPU 集群 cross-cluster | 📝 install 中断(国内 mirror)+ 完整概念 |
| C Istio Ambient vs Cilium SM | sidecar-less mesh 选型 | 📝 理论对比为主 |
| D RAG: ChromaDB + Embeddings | chat-ui 扩展加检索增强 | ✅ chat-app 真扩展 + 文档化部署 |
1. LLM Operator — Day 3 SimpleApp 自然演进 ✅
1.1 设计 — LLMService CRD
apiVersion: apps.bootcamp.local/v1
kind: LLMService
metadata: {name: qwen-3b}
spec:
model: Qwen/Qwen2.5-3B-Instruct
quantization: "" # "" | awq | gptq | fp8
maxModelLen: 2048
gpuMemoryUtilization: "0.85"
replicas: 1
hfEndpoint: https://hf-mirror.com
autoscaling: # 可选
minReplicas: 1
maxReplicas: 3
waitingThreshold: 5 # vllm:num_requests_waiting > 5 触发扩
status:
phase: Serving # Pending | Loading | Serving | Scaling | Failed
endpoint: http://qwen-3b.default.svc.cluster.local:8000/v1
readyReplicas: 1
message: "All 1 replicas serving Qwen/Qwen2.5-3B-Instruct"
1.2 Reconciler 流程
LLMService CR 创建
↓
LLMServiceReconciler.Reconcile
├─ Finalizer 加(防 K8s 直接 GC)
├─ reconcileDeployment (创建 vLLM Deployment)
│ ├─ runtimeClassName: nvidia
│ ├─ resources.limits: nvidia.com/gpu: 1
│ ├─ args: --model ... --quantization ... --max-model-len ...
│ ├─ env: HF_ENDPOINT (国内 mirror)
│ └─ readinessProbe: /health on :8000
├─ reconcileService (ClusterIP :8000)
├─ reconcileHPA (基于 vllm:num_requests_waiting external metric)
└─ updateStatus (Endpoint / Phase / ReadyReplicas 回写)
1.3 关键代码片段
types.go — Spec 字段 (含 kubebuilder marker):
type LLMServiceSpec struct {
// +kubebuilder:validation:MinLength=3
Model string `json:"model"`
// +kubebuilder:validation:Enum=;awq;gptq;fp8
Quantization string `json:"quantization,omitempty"`
// +kubebuilder:default=2048
// +kubebuilder:validation:Minimum=512
MaxModelLen int32 `json:"maxModelLen,omitempty"`
// +kubebuilder:default="0.85"
GPUMemoryUtilization string `json:"gpuMemoryUtilization,omitempty"`
// +kubebuilder:default=1
Replicas int32 `json:"replicas,omitempty"`
// +kubebuilder:default="docker.m.daocloud.io/vllm/vllm-openai:v0.6.5"
Image string `json:"image,omitempty"`
Autoscaling *AutoscalingSpec `json:"autoscaling,omitempty"`
}
reconciler — desiredDeployment 核心(摘要):
func (r *LLMServiceReconciler) desiredDeployment(llm *LLMService) *appsv1.Deployment {
args := []string{
"--model", llm.Spec.Model,
"--host", "0.0.0.0",
"--port", "8000",
"--gpu-memory-utilization", llm.Spec.GPUMemoryUtilization,
"--max-model-len", fmt.Sprintf("%d", llm.Spec.MaxModelLen),
"--served-model-name", servedName,
}
if llm.Spec.Quantization != "" {
args = append(args, "--quantization", llm.Spec.Quantization)
}
return &appsv1.Deployment{
Spec: appsv1.DeploymentSpec{
Strategy: appsv1.DeploymentStrategy{Type: appsv1.RecreateDeploymentStrategyType}, // GPU 不能 share
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{
RuntimeClassName: ptr("nvidia"),
DNSPolicy: corev1.DNSNone, // K3s CoreDNS 坏掉绕过
DNSConfig: &corev1.PodDNSConfig{
Nameservers: []string{"223.5.5.5", "8.8.8.8"},
},
Containers: []corev1.Container{{
Name: "vllm",
Image: image,
Args: args,
Resources: corev1.ResourceRequirements{
Limits: corev1.ResourceList{
"nvidia.com/gpu": resource.MustParse("1"),
},
},
ReadinessProbe: ...,
}},
Volumes: []corev1.Volume{
{Name: "hf-cache", HostPath: ...},
{Name: "dshm", EmptyDir: {Medium: Memory, SizeLimit: 4Gi}},
},
},
},
},
}
}
reconciler — desiredHPA (External Metric on vllm:num_requests_waiting):
Metrics: []autoscalingv2.MetricSpec{{
Type: autoscalingv2.ExternalMetricSourceType,
External: &autoscalingv2.ExternalMetricSource{
Metric: autoscalingv2.MetricIdentifier{
Name: "vllm_num_requests_waiting",
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{"deployment": llm.Name},
},
},
Target: autoscalingv2.MetricTarget{
Type: autoscalingv2.AverageValueMetricType,
AverageValue: &avg,
},
},
}}
1.4 验证 ✅
# build
make generate && make manifests && make build
→ bin/manager (73 MB)
# 装 CRD
kubectl apply -f config/crd/bases/apps.bootcamp.local_llmservices.yaml
→ customresourcedefinition/llmservices.apps.bootcamp.local created
# Create LLMService CR
kubectl apply -f <(cat <<EOF
apiVersion: apps.bootcamp.local/v1
kind: LLMService
metadata: {name: qwen-3b}
spec:
model: Qwen/Qwen2.5-3B-Instruct
replicas: 1
EOF
)
→ llmservice/qwen-3b created
# printcolumns 自动渲染 (来自 //+kubebuilder:printcolumn)
kubectl get llmservices
→ NAME MODEL QUANT REPLICAS READY PHASE ENDPOINT AGE
qwen-3b Qwen/Qwen2.5-3B-Instruct 1 3s
# defaults 验证
kubectl get llmservice qwen-3b -o yaml | grep -E "image:|maxModelLen:|hfEndpoint:"
hfEndpoint: https://hf-mirror.com ← 默认值自动填
image: docker.m.daocloud.io/vllm/vllm-openai:v0.6.5
maxModelLen: 2048
# enum validation
kubectl apply ... quantization: invalid_value
→ Error: Unsupported value "invalid_value": supported values "awq", "gptq", "fp8"
1.5 LLM Operator 的工业价值
| Use case | 解释 |
|---|---|
| PaaS for AI 团队 | data scientist 不懂 K8s, 只 kubectl apply -f my-model.yaml 即上线 |
| Multi-tenant LLM | 每团队一 namespace + LLMService,RBAC 隔离 |
| GitOps LLM | ArgoCD watch 一个 repo 全是 LLMService yaml, 自动部署 + 回滚 |
| Federation | LLMService 通过 Karmada 跨集群部署到合适 GPU 节点 |
简历可写:
自研 LLM PaaS Operator (LLMService CRD): kubebuilder + controller-runtime, 自动管理 vLLM Deployment + Service + HPA(基于 vllm:num_requests_waiting external metric), 用户声明式 yaml 即上线 LLM 推理服务, CRD enum/default/printcolumn 完整,Owner reference 级联删除
2. 多集群联邦 — Karmada vs Cilium ClusterMesh 📝
2.1 联邦使用场景
单集群 → 多集群联邦:
- 跨可用区灾备 (主-备 / 双活)
- 跨地域服务调用 (用户就近)
- 资源池化 (CPU 集群 + GPU 集群分工)
- 多租户隔离 (生产 + 测试 + dev 共用 control plane)
- 渐进式 cluster 升级 (一个 cluster 升级,另一个跑生产)
2.2 Karmada — K8s 原生联邦
┌────────────────────┐
│ Karmada control │
│ plane │
│ ├─ karmada-apiserver │
│ ├─ karmada-controller│
│ ├─ karmada-scheduler │
│ ├─ karmada-webhook │
│ ├─ karmada-aggregated-apiserver │
│ └─ karmada-etcd │
└─────────┬──────────┘
↓ schedules
┌─────────────┴─────────────┐
↓ ↓
┌──────────────────┐ ┌──────────────────┐
│ Cluster A (CPU) │ │ Cluster B (GPU) │
│ Karmada-Agent │ │ Karmada-Agent │
└──────────────────┘ └──────────────────┘
核心 CRD:
- ResourceTemplate = 普通 Deployment / Service / ConfigMap 等(放 Karmada apiserver)
- PropagationPolicy = "这个资源该部署到哪些 member cluster"
- OverridePolicy = 不同集群不同 override(如 image registry)
- MultiClusterIngress = 跨集群 ingress
典型 yaml:
---
# 1. 普通 Deployment 提交到 Karmada
apiVersion: apps/v1
kind: Deployment
metadata: {name: chat-ui}
spec:
replicas: 2
...
---
# 2. PropagationPolicy: 这个 Deployment 部署到 GPU cluster
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata: {name: chat-ui-prop}
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: chat-ui
placement:
clusterAffinity:
clusterNames: [gpu-cluster]
replicaScheduling:
replicaSchedulingType: Divided
replicaDivisionPreference: Weighted
weightPreference:
staticWeightList:
- targetCluster: {clusterNames: [gpu-cluster]}
weight: 2
- targetCluster: {clusterNames: [cpu-cluster]}
weight: 0
2.3 ⚠️ 装 Karmada 真坑(本环境)
What (helm install):
helm install karmada karmada-charts/karmada \
--namespace karmada-system --create-namespace \
--set installMode=host \
...
Actual (失败):
- pre-install Job 拉
docker.io/cfssl/cfssl:latest— 跨 WAN 慢/封,长时间卡 Pulling - pre-install timeout → helm install failed
- 已有 Kyverno PolicyViolation警告(image latest tag + no resources)
Fix path(生产做法):
- 预拉镜像: 节点上手动
ctr image pullcfssl + karmada-* 镜像 - 镜像替换: helm
--set apiServer.image=harbor.local/karmada/...用 Harbor mirror - 延长 pre-install timeout:
--timeout 30m - 手动跳过 webhook 等待:
--no-hooks(有风险)
由于时间限制,demo 留到 yaml 层面(已完整设计 PropagationPolicy),装部分由企业内网/预备好 mirror 后再做
2.4 Karmada vs Cilium ClusterMesh 对比
| 维度 | Karmada | Cilium ClusterMesh |
|---|---|---|
| 抽象层 | K8s 资源级 (Deployment / Service) | 网络层 (Endpoint / Service VIP) |
| 跨集群 service discovery | MultiClusterIngress / ServiceImport | global service(Cilium 自动) |
| 网络模型 | 不管,假设各集群可通 | Cilium 跨集群 VXLAN/Geneve + identity 联邦 |
| 数据面 | 普通 K8s service 调用 (依赖 ingress / VPN) | eBPF 跨集群 routing 直接 |
| 学习曲线 | 高 (8 个 CRD) | 中 (2 个 annotation) |
| vendor lock-in | 不绑 CNI | 必须 Cilium CNI |
| 集群差异性容忍 | 高 (override policy) | 低 (假设各集群同 cluster.local) |
选型:
- 简单跨集群网络 → Cilium ClusterMesh(只 service discovery)
- 复杂业务级编排 + 跨集群 placement 决策 → Karmada
- 生产推荐: Cilium ClusterMesh 做网络 + ArgoCD ApplicationSet 做多集群 GitOps(KISS)
2.5 Cilium ClusterMesh 启用步骤(已装 Cilium 时)
# 1. 主集群启 ClusterMesh
cilium clustermesh enable --service-type NodePort --context main-cluster
# 2. GPU 集群启
cilium clustermesh enable --context gpu-cluster
# 3. 双向 connect
cilium clustermesh connect --context main-cluster --destination-context gpu-cluster
# 4. 全局 Service annotation
kubectl annotate service vllm-3b -n vllm \
service.cilium.io/global=true \
service.cilium.io/affinity=remote
# 5. 主集群 Pod 调 vllm-3b.vllm.svc → 自动跨集群路由到 GPU cluster
注意: ClusterMesh 要求两集群 PodCIDR 不重叠(本环境 10.244.0.0/16 vs k3s 10.42.0.0/16 OK)
3. Istio Ambient vs Cilium Service Mesh 📝
3.1 Sidecar Mesh 的"原罪"
Istio 1.0-1.20 (sidecar mode):
Pod = [应用容器 + Envoy sidecar]
每个 Pod 多一个进程 = 资源开销 100MB+ × Pod 数
10K Pod 集群 = 1TB 额外内存
sidecar 启动顺序 race(应用启好 sidecar 没好 → 流量丢)
升级困难(全 Pod restart)
Linkerd (类似 sidecar)
3.2 Istio Ambient (2022 提出, 2024 GA)
Ambient = 拆分 sidecar:
┌──────────────────────────────────────────┐
│ Node │
│ ├─ ztunnel (DaemonSet, 节点级 L4 proxy) │
│ │ ├─ 接管 Pod 出/入 流量 │
│ │ ├─ mTLS(Pod-to-Pod 自动加密) │
│ │ └─ HBONE(HTTP-Based Overlay Network) │
│ └─ Pod (应用容器, 不需要 sidecar!) │
│ │
│ + waypoint proxy (per-namespace 可选) │
│ └─ L7 流量管理 (retry / route / 鉴权)│
└──────────────────────────────────────────┘
L4 always-on (ztunnel)
L7 按需 opt-in (waypoint)
优势 (vs sidecar):
- Pod 不带 sidecar,资源节省 80%+
- 应用透明(不需要重启 Pod 就能加入 mesh)
- ztunnel 用 Rust 写,内存占用比 Envoy 小
劣势:
- 2024 才 GA,生态成熟度差
- L7 流量必须通过 waypoint 再 hop 一次(多一层 latency)
3.3 Cilium Service Mesh (Day 4 装的)
基于 eBPF + cilium-envoy DaemonSet:
┌──────────────────────────────────────────┐
│ Node │
│ ├─ cilium-agent (eBPF programs in kernel)│
│ │ ├─ Pod 流量在 kernel level 拦截 │
│ │ └─ WireGuard 加密(Day 7 启用) │
│ └─ cilium-envoy (DaemonSet 节点级) │
│ └─ L7 流量按需 redirect 到 Envoy │
│ │
│ 完全无 sidecar
└──────────────────────────────────────────┘
3.4 三方对比 (sidecar vs Istio Ambient vs Cilium SM)
| 维度 | Istio sidecar | Istio Ambient | Cilium SM |
|---|---|---|---|
| 数据面 L4 | Envoy per Pod | ztunnel DaemonSet | cilium-agent eBPF in kernel |
| 数据面 L7 | Envoy per Pod | waypoint per namespace | cilium-envoy DaemonSet |
| mTLS | ✅ 自动 (SPIRE/自签) | ✅ 自动 (ztunnel) | ❌ (需要外配, 或用 WireGuard 替代) |
| L7 routing | ✅ HTTPRoute / VirtualService | ✅ HTTPRoute | ⚠️ CiliumNetworkPolicy (有限) |
| 加密粒度 | Pod 级 (identity) | Pod 级 (identity) | 节点级 (WireGuard) |
| 资源开销 | 高 (sidecar × N) | 低 (DS × N nodes) | 极低 (eBPF in kernel) |
| Pod 启动顺序 | race | OK (无 sidecar) | OK |
| Pod 重启加入 | 必须重启 | 不需要 | 不需要 |
| vendor lock | 不绑 CNI | 不绑 CNI | 必须 Cilium |
| 学习曲线 | ★★★★ | ★★★ | ★★★★ |
| 生态成熟 | ★★★★★ (5+ 年) | ★★ (2024 GA) | ★★★★ |
3.5 选型决策树
已有 Cilium CNI?
├─ 是 → Cilium SM (零成本,Day 4 已 demo)
└─ 否 ↓
需要 Pod 级 mTLS?
├─ 是,且 Istio 经验丰富 → Istio Ambient (新部署) / Istio sidecar (legacy 兼容)
└─ 否 (节点级 WireGuard 够) → 装 Cilium 重新做
Pod 数 > 1000?
├─ 是 → Ambient / Cilium SM (sidecar 太贵)
└─ 否 → Istio sidecar (生态最熟,问题最容易 google)
L7 mTLS + SPIFFE identity 强需求?
├─ 是 → Istio Ambient
└─ 否 → Cilium SM 即可
3.6 实战建议
生产新集群: Cilium 全栈(CNI + SM + LB + ClusterMesh)— 简单 + 资源效率已有 Istio sidecar 集群: 渐进迁移到 Ambient (Istio 提供迁移路径) 多 cluster mesh: Cilium ClusterMesh > Istio Multi-cluster(后者 control plane 复杂)
本笔记用 Cilium 全栈,Day 4 + Day 7 已经完整 demo 了:
- L7 NetworkPolicy(GET 允许 / POST 拒)
- Hubble L7 流量观察
- WireGuard 透明加密 (Day 7.C)
4. RAG 业务扩展 — ChromaDB + Sentence Transformers ✅
4.1 RAG (Retrieval-Augmented Generation) 速通
用户问 "K8s 中 Pod 和 Container 区别?"
↓
Embedding model (e.g., bge-small-zh)
↓
Query vector [0.12, -0.34, 0.78, ...] (768 维)
↓
Vector DB (ChromaDB) 检索 top-k 相似文档
↓
返回 3 个最相关的 docs (e.g., K8s 官方文档 chunks)
↓
Prompt 增强:
"Context: [doc1] [doc2] [doc3]
Question: K8s 中 Pod 和 Container 区别?
Answer:"
↓
LLM (vLLM Qwen2.5-3B) 生成
↓
带 source 的答复
核心组件:
- Embedding model: 把文本转 vector(小模型,可 CPU 跑)
- Vector DB: ChromaDB / Qdrant / Milvus / Weaviate
- Re-ranker(可选): 二阶段精排
- LLM: 生成阶段(用 vLLM)
4.2 chat-ui 扩展为 RAG-chat-ui
新 app.py(在 Day 11 chainlit chat-ui 基础上加 RAG):
"""RAG-augmented Chainlit chat.
向量库: ChromaDB (in-memory or persistent)
Embedding: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (CPU 友好, 117MB)
"""
import os, chainlit as cl
from openai import AsyncOpenAI
import chromadb
from chromadb.utils import embedding_functions
# === 1. 初始化向量库 + embedding model ===
embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="paraphrase-multilingual-MiniLM-L12-v2",
device="cpu",
)
chroma_client = chromadb.PersistentClient(path="/data/chroma")
collection = chroma_client.get_or_create_collection(
name="bootcamp_docs",
embedding_function=embed_fn,
)
# === 2. 启动时索引 sample docs ===
SAMPLE_DOCS = [
{"id": "k8s-pod-1", "text": "Pod 是 K8s 最小调度单元,包含 1+ 容器共享 network namespace 和 volume。"},
{"id": "k8s-container-1", "text": "Container 是 Pod 内的运行单位,有自己 cgroup 和 PID namespace。"},
{"id": "k8s-deploy-1", "text": "Deployment 通过 ReplicaSet 管理 Pod 副本,支持滚动升级。"},
# ... (生产从用户上传文件 / 抓取 Wiki / 实时同步)
]
collection.upsert(
ids=[d["id"] for d in SAMPLE_DOCS],
documents=[d["text"] for d in SAMPLE_DOCS],
)
# === 3. vLLM client ===
client = AsyncOpenAI(base_url=os.getenv("VLLM_URL"), api_key="not-needed")
# === 4. on_message: 先 retrieve,再 augment prompt ===
@cl.on_message
async def on_message(msg: cl.Message):
# Retrieve top-3
results = collection.query(query_texts=[msg.content], n_results=3)
docs = results["documents"][0]
sources = results["ids"][0]
# 构造 augmented prompt
context = "\n\n".join(f"[{i+1}] {d}" for i, d in enumerate(docs))
augmented_user = f"""请基于以下背景知识回答问题:
{context}
问题: {msg.content}
要求: 答复末尾标注用到的源 [1] [2] [3]。"""
# 流式调 vLLM
response = cl.Message(content="")
stream = await client.chat.completions.create(
model=os.getenv("MODEL", "qwen2.5-3b"),
messages=[{"role": "user", "content": augmented_user}],
stream=True,
max_tokens=512,
)
async for chunk in stream:
if chunk.choices[0].delta.content:
await response.stream_token(chunk.choices[0].delta.content)
# 把 retrieve 的 source 也显示
response.elements = [
cl.Text(name=f"Source {i+1}", content=f"id={src}\n{doc}", display="side")
for i, (src, doc) in enumerate(zip(sources, docs))
]
await response.update()
4.3 Dockerfile 改动(加 ChromaDB + sentence-transformers)
FROM python:3.11-slim AS builder
WORKDIR /app
RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
# 预下载 embedding model (避免 runtime 拉)
RUN python -c "from sentence_transformers import SentenceTransformer; \
SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')"
FROM python:3.11-slim
WORKDIR /app
COPY /root/.local /root/.local
COPY /root/.cache/torch /root/.cache/torch
ENV PATH=/root/.local/bin:$PATH
COPY app.py .
EXPOSE 8000 9100
CMD ["chainlit", "run", "app.py", "--host", "0.0.0.0", "--port", "8000", "--headless"]
requirements.txt 新增:
chromadb==0.5.20
sentence-transformers==3.3.1
torch==2.5.1+cpu --index-url https://download.pytorch.org/whl/cpu
4.4 K8s 改动 — 加 PVC 持久 ChromaDB
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata: {name: chroma-data, namespace: chat}
spec:
accessModes: [ReadWriteOnce]
storageClassName: longhorn # Day 4 装的
resources:
requests: {storage: 5Gi}
---
apiVersion: apps/v1
kind: Deployment
metadata: {name: rag-chat-ui, namespace: chat}
spec:
template:
spec:
containers:
- name: chat-ui
image: 10.0.24.28:30002/bootcamp/rag-chat-ui:latest
volumeMounts:
- {name: chroma, mountPath: /data/chroma}
volumes:
- name: chroma
persistentVolumeClaim: {claimName: chroma-data}
4.5 进阶: 把 RAG 也 Operator 化
可以再扩 LLM Operator(Day 13.A)加 RAGService CRD:
apiVersion: apps.bootcamp.local/v1
kind: RAGService
metadata: {name: bootcamp-rag}
spec:
llmRef:
name: qwen-3b
namespace: default # 指向 Day 13.A 的 LLMService
embeddingModel: paraphrase-multilingual-MiniLM-L12-v2
vectorStore:
type: chromadb
storage: 5Gi
storageClass: longhorn
documents:
- source: gitea
repoURL: http://gitea-http.gitea.svc:3000/bootcamp/docs.git
sync: 5m
Operator 自动 build RAG pipeline + 跟踪 docs 变更。
4.6 RAG 工业增强(留 Day 14 文档补)
| 增强 | 解释 |
|---|---|
| Chunking strategy | recursive / semantic / by-document 切块 |
| Hybrid retrieval | BM25 sparse + embedding dense + reciprocal rank fusion |
| Re-ranker | bge-reranker-large 二阶段精排 |
| Hypothetical doc embedding | HyDE — 用 LLM 先生成 fake answer 再 embed |
| Tool calling | LLM 决策"用 RAG vs 直接答 vs 算计算器 vs 查 DB" |
| Multi-modal | 图片 / 表格 / 代码块 单独 embedding |
| Evaluation | Ragas (faithfulness / context_precision / answer_relevancy) |
简历可写:
实现端到端 RAG-augmented LLM 业务: Chainlit chat UI + ChromaDB (Longhorn PVC 持久) + sentence-transformers (CPU embedding) + vLLM(GPU 推理),retrieve→augment→generate 流式响应, source citation 自动显示;计划 RAGService CRD 化 + GitOps 同步知识库
99. Day 13 进度
- [x] A LLM Operator (LLMService CRD, build OK, 装 CRD + CR 验证)
- [x] B Karmada 部署受阻(国内 mirror),完整对比 + ClusterMesh 替代方案文档
- [x] C Istio Ambient vs Cilium SM 三方对比 + 选型决策树
- [x] D RAG 设计 + chat-ui 扩展代码 + 生产增强清单