Ingress 流量端到端

这一篇追踪 K8s 流量从外部 / pod 发出 → 经过层层封装 → 到达目标 pod 的完整路径。每一跳都对应一个组件（DNS / iptables / IPVS / CNI / kube-proxy / ingress controller），看清楚之后排错"为啥这个 Service 不通"立刻有方向。

这篇要回答什么

Pod 访问 Service curl http://my-svc —— 包到底走了几跳？
外部访问 NodePort 流量怎么进来的？
为什么 LoadBalancer 在云上一装就有公网 IP？
Ingress 和 Gateway API 有啥区别？
kube-proxy 的 iptables / IPVS 模式分别怎么实现？
NetworkPolicy 是怎么强制执行的？
流量"看着对、就是不通"的常见原因？

1. K8s 网络的 4 个"网段"

K8s 集群里有 4 个不同含义的 IP 段，搞清楚它们各自的角色是理解网络的第一步。

网段	例子	用于
节点网络	10.0.24.0/24	节点之间通信（节点 eth0）
Pod CIDR	10.244.0.0/16	所有 pod 的 IP
Service CIDR	10.96.0.0/12	Service ClusterIP（虚拟 IP）
集群 DNS	10.96.0.10	CoreDNS 的 Service IP（固定）

看本集群配置

# Service CIDR
$ kubectl cluster-info dump | grep -m 1 service-cluster-ip-range
"--service-cluster-ip-range=10.96.0.0/12"

# Pod CIDR (各节点)
$ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
m1    10.244.0.0/24
m2    10.244.1.0/24
m3    10.244.2.0/24

# Cluster DNS IP
$ kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'
10.96.0.10

包流向

graph TB
    subgraph 节点A[节点 A: 10.0.24.28]
        PA[Pod A: 10.244.0.5]
    end
    subgraph 节点B[节点 B: 10.0.24.29]
        PB[Pod B: 10.244.1.6]
    end
    subgraph K8s控制面[K8s 抽象]
        SVC[Service: 10.96.1.5]
    end

    PA -->|1. DNS 解析 svc| SVC
    PA -->|2. 真实流量 DNAT 后| PB

    style 节点A fill:#e1f5ff
    style 节点B fill:#e1f5ff
    style K8s控制面 fill:#fff4e1

2. Pod → Service ClusterIP 完整流程（最常见场景）

Pod A (10.244.0.5) 访问 my-svc.default.svc.cluster.local（背后是 Pod B 10.244.1.6）。

第 0 步：应用调用

resp, _ := http.Get("http://my-svc:8080/api")

Go runtime → libc / netpoller → 触发 DNS 查询 → 触发 TCP connect。

第 1 步：DNS 解析

sequenceDiagram
    participant App as App in Pod A
    participant Resolver as Pod A glibc resolver
    participant CoreDNS as CoreDNS pod
    participant API as kube-apiserver

    App->>Resolver: getaddrinfo("my-svc")
    Note over Resolver: 读 /etc/resolv.conf<br>nameserver 10.96.0.10<br>search default.svc.cluster.local ...<br>ndots:5
    Note over Resolver: "my-svc" 含点数 0 < 5<br>→ 先拼 search domain
    Resolver->>CoreDNS: my-svc.default.svc.cluster.local. A?
    Note over CoreDNS: 查内置 plugin (kubernetes)<br>从 etcd 缓存读 Service ClusterIP
    CoreDNS-->>Resolver: 10.96.1.5 (TTL=30)
    Resolver-->>App: 10.96.1.5

详细 ndots:5 的坑见 04-dns-deep.md。

第 2 步：TCP connect 到 ClusterIP

应用调 connect(10.96.1.5, 8080)。内核做的事：

sequenceDiagram
    autonumber
    participant App
    participant Kernel as Pod A 内核 (其实是节点内核)
    participant Iptables as iptables / kube-proxy 规则
    participant PodB as Pod B

    App->>Kernel: connect(10.96.1.5:8080)
    Kernel->>Iptables: 包出 pod → 入节点 OUTPUT 链
    Note over Iptables: 经 KUBE-SERVICES 子链<br>匹配 dst=10.96.1.5:8080
    Note over Iptables: 跳到 KUBE-SVC-MYAPP 链<br>按 probability 选 endpoint
    Note over Iptables: 跳到 KUBE-SEP-XXX<br>DNAT: 10.96.1.5:8080 → 10.244.1.6:8080
    Note over Kernel: 包现在 dst=10.244.1.6<br>查路由表: via 10.0.24.29 dev eth0
    Kernel->>PodB: SYN
    PodB-->>Kernel: SYN-ACK<br>(conntrack 反查、改 src 回 10.96.1.5)
    Kernel-->>App: 握手完成

关键：DNAT 只发生在第一个包

第一个 SYN 经 iptables 时做 DNAT、记入 conntrack。后续同一连接的包不再经过 iptables 匹配——直接走 conntrack 表的反向 tuple。

这就是为什么 iptables 规则多的时候，只有新连接有性能开销、已建立的连接不受影响。

但 conntrack 表满时，新连接被拒——这就是 conntrack 表大小要调的根本原因（见 00-mental-model.md §5）。

看实际的 iptables 链（手把手追）

# 1. 起点: 节点上 OUTPUT 链 nat 表
$ iptables -t nat -L OUTPUT -n -v | head
Chain OUTPUT (policy ACCEPT 1234 packets)
 pkts bytes target           prot opt in     out     source       destination
 1234  56789 KUBE-SERVICES   all  --  *      *       0.0.0.0/0    0.0.0.0/0   /* kubernetes service portals */

# 2. 进 KUBE-SERVICES 链
$ iptables -t nat -L KUBE-SERVICES -n -v | grep 10.96.1.5
  100  6000 KUBE-SVC-XYZ123  tcp  --  *  *  0.0.0.0/0  10.96.1.5  /* default/my-svc */ tcp dpt:8080

# 3. 进 KUBE-SVC-XYZ123 链
$ iptables -t nat -L KUBE-SVC-XYZ123 -n -v
 pkts bytes target          prot opt ...
   30  1800 KUBE-MARK-MASQ   all  --  ... !10.244.0.0/16  10.96.1.5  /* default/my-svc */
                                       ^^^^^^^^^^^^^^^^^
                                       源不是 pod CIDR 的(外部) 打 MASQ 标记
   33  1980 KUBE-SEP-AAA     all  --  ...  /* probability 0.5 */ statistic mode random probability 0.5
   33  1980 KUBE-SEP-BBB     all  --  ...  /* */                                    ← 没 probability 兜底

# 4. 进 KUBE-SEP-AAA 链 (具体 endpoint)
$ iptables -t nat -L KUBE-SEP-AAA -n -v
 pkts bytes target  prot opt ...
   30  1800 KUBE-MARK-MASQ  all  --  ... 10.244.1.6                                  ← 源是自己 (hairpin) 打 MASQ
   33  1980 DNAT             tcp  --  ...                                            ← 实际 DNAT
                                      to:10.244.1.6:8080

每条 KUBE-SEP-xxx 对应一个 endpoint pod。多个 endpoint 用 statistic match 按概率分流。

probability 数学（为什么不是简单平均）

3 个 endpoint 的概率：

SEP-AAA: probability 0.333    (1/3)
SEP-BBB: probability 0.5      (1/2 of remaining 2/3 = 1/3)
SEP-CCC: (no probability)     (rest = 1/3)

iptables 按顺序匹配。每条 probability 是"如果走到这条规则、命中的概率"。算法保证三个 endpoint 等概率分流。

3. kube-proxy 模式：iptables vs IPVS

graph LR
    subgraph iptables 模式
        T1[每个 Service 一条规则<br>+ 每个 Endpoint 一条] --> T2[O(N) 线性匹配]
        T2 --> T3[1000 个 Service<br>= 几千条规则]
    end

    subgraph IPVS 模式
        I1[内核 LB 模块<br>哈希查找] --> I2[O(1) 查询]
        I2 --> I3[万级 Service 性能稳定]
    end

    style T3 fill:#ffe1e1
    style I3 fill:#e1ffe1

看当前用什么模式

$ kubectl get cm -n kube-system kube-proxy -o jsonpath='{.data.config\.conf}' | grep mode
mode: ""              # 默认 iptables
# 或
mode: ipvs            # IPVS

iptables 模式（默认）

每个 Service + Endpoint 都生成 iptables 规则
匹配是线性的（O(N)）
Service 数量上万时：每次新连接要扫几千条规则 → 延迟可见

IPVS 模式

基于内核 IPVS 模块（专为 LB 设计）
哈希表查找（O(1)）
看 IPVS 规则：

$ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.96.1.5:8080 rr
  -> 10.244.0.5:8080              Masq    1      10         0
  -> 10.244.1.6:8080              Masq    1      8          1
  -> 10.244.2.7:8080              Masq    1      12         0

rr = round-robin 调度算法。还可以选 lc（least connection）/ sh（source hash）等。

Cilium kube-proxy replacement

Cilium 直接用 eBPF 替代 kube-proxy（详见 ../commands/cilium.md）。性能最好、可观察性最强（hubble）。

切换 kube-proxy 模式：

$ kubectl edit cm -n kube-system kube-proxy
# data.config.conf:
#   mode: "ipvs"

$ kubectl rollout restart ds -n kube-system kube-proxy

注意：切到 IPVS 之后 kube-proxy 会清掉旧 iptables 规则、重建 IPVS 规则。业务流量短暂受影响。

4. Service 四种类型完整流量

ClusterIP（仅集群内可达）

graph LR
    PodA[Pod A] -->|10.96.1.5:8080| Iptables[iptables DNAT]
    Iptables -->|10.244.1.6:8080| PodB[Pod B]

详见 §2。

NodePort（节点 IP 上开端口）

每个节点都监听 <NodePort> 端口（默认 30000-32767）。外部访问任意节点 IP:NodePort 都能到达。

sequenceDiagram
    participant Ext as 外部客户端
    participant M1 as 节点 m1<br>10.0.24.28
    participant Iptables as m1 iptables
    participant Pod as Pod B (10.244.1.6 on m2)

    Ext->>M1: TCP 10.0.24.28:30080
    Note over Iptables: 进 PREROUTING<br>经 KUBE-NODEPORTS 链<br>匹配 dpt:30080<br>跳到 KUBE-SVC-XXX
    Note over Iptables: DNAT: 10.0.24.28:30080 → 10.244.1.6:8080
    Note over Iptables: 包源 IP 是外部、但 dst 已改成 pod IP<br>需要 SNAT 让回包能回到 m1<br>KUBE-MARK-MASQ + POSTROUTING SNAT
    Iptables->>Pod: src=m1 IP, dst=pod IP
    Pod-->>Iptables: 回包到 m1
    Iptables-->>Ext: 反向改回

`externalTrafficPolicy`：Local vs Cluster

关键配置，决定客户端真实 IP 是否被保留：

Cluster（默认）

spec:
  type: NodePort
  externalTrafficPolicy: Cluster

流量进任何节点都 SNAT + 转发到 pod（可能跨节点）
优点：每个节点都能服务、负载均衡
缺点：pod 看到的客户端 IP 是节点 IP（真实 IP 丢失）

Local（真实 IP 保留）

spec:
  type: NodePort
  externalTrafficPolicy: Local

流量只转发到本节点上的 pod
优点：pod 看到真实客户端 IP（不 SNAT）
缺点：节点上没 pod 时拒绝服务 → 需要 LB 健康检查配合
用例：日志 / 审计需要真实 IP / 限流按 IP

LoadBalancer（云上一键暴露）

graph LR
    Net[外部网络] -->|公网 IP:80| CloudLB[云 LB<br>AWS NLB / Aliyun SLB]
    CloudLB --> N1[节点 m1:30080]
    CloudLB --> N2[节点 m2:30080]
    CloudLB --> N3[节点 m3:30080]
    N1 --> Pod[Pod]

云厂商 cloud-controller-manager 给 Service 分配 LB + IP
LB 监听公网 IP:80、转发到所有节点的 NodePort
节点再 DNAT 到 pod

自建集群没云 LB → MetalLB 替代：在 ARP 广播层"宣告" Service IP，让节点抢这个 IP。

Headless Service（clusterIP: None）

spec:
  clusterIP: None
  selector: {...}

没 ClusterIP
DNS 查询直接返回 pod IP 列表（A 记录）
客户端自己负载均衡

$ dig my-headless.default.svc.cluster.local +short
10.244.0.5
10.244.1.6
10.244.2.7

适合 StatefulSet、Cassandra / Kafka 客户端、gRPC 自带 LB 等场景。

ExternalName（DNS CNAME）

spec:
  type: ExternalName
  externalName: db.production.example.com

CoreDNS 把 my-svc.default.svc.cluster.local CNAME 到 db.production.example.com。不经 iptables、纯 DNS。

5. Endpoint vs EndpointSlice

Service "选 pod" 是基于 selector。具体哪些 pod 是 endpoint，靠 Endpoints（老）/ EndpointSlice（K8s 1.21+ 推荐）资源记录。

# 老的 Endpoints (单一对象)
$ kubectl get endpoints my-svc
NAME      ENDPOINTS                                       AGE
my-svc    10.244.0.5:8080,10.244.1.6:8080,10.244.2.7:8080  1h

# 新的 EndpointSlice (多个对象 / 大集群高效)
$ kubectl get endpointslice -l kubernetes.io/service-name=my-svc
NAME             ADDRESSTYPE   PORTS   ENDPOINTS                                AGE
my-svc-abc123    IPv4          8080    10.244.0.5,10.244.1.6,10.244.2.7         1h

endpoint 没了的常见原因

$ kubectl get endpoints my-svc
NAME      ENDPOINTS   AGE
my-svc    <none>      1h

→ Service 没"选到" pod。可能：

selector 写错 —— selector: app=foo 但 pod label 是 app: bar
pod 没 Ready —— Endpoint controller 只把 Ready pod 加进去（readinessProbe 失败 → pod 不 ready → 不进 endpoint）
port 名字 / 端口对不上 —— Service targetPort: http、pod 没声明叫 http 的端口
namespace 错 —— Service 只在自己 namespace 选 pod

# 看 pod 的真实 label
$ kubectl get pod -l app=foo --show-labels

# 看 pod ready 状态
$ kubectl get pod -o wide

# 看 readinessProbe 失败原因
$ kubectl describe pod ⟨pod⟩

6. NetworkPolicy —— K8s 防火墙

K8s 原生默认是"任何 pod 能访问任何 pod"（开放）。要限制就用 NetworkPolicy。

最小例子：默认拒绝

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: default
spec:
  podSelector: {}                # 选所有 pod
  policyTypes:
    - Ingress
  # ingress 字段为空 = 全拒

应用后，default ns 里所有 pod 拒绝任何入向流量。

加白名单

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080

意思："backend pod 只接受来自 frontend pod 的 TCP:8080"。

NetworkPolicy 谁来强制执行

K8s 自己不实现 NetworkPolicy——必须 CNI 支持：

CNI	NetworkPolicy 支持
Cilium	✅（eBPF）
Calico	✅（iptables / eBPF）
Weave	✅
Flannel	❌（要装 Calico for NetworkPolicy）

CNI 把 NetworkPolicy 翻译成 iptables 规则 / eBPF 程序。Cilium 用 hubble 看到 DROPPED：

$ hubble observe --verdict DROPPED --pod default/backend
... Policy denied (L3-L4) from default/random-pod to default/backend

详见 ../commands/cilium.md。

反面教材

NetworkPolicy 写完不生效

$ kubectl get netpol
NAME    POD-SELECTOR   AGE
deny    <none>         1h

但流量还能通——

可能：

CNI 不支持 NetworkPolicy（flannel 默认）
podSelector 没匹配上 任何 pod
policyTypes 没声明 Ingress / Egress（默认只 Ingress）
错命名空间 —— NetworkPolicy 只影响自己 ns 的 pod

# 看 CNI 是不是支持
$ kubectl get pods -n kube-system | grep -iE 'cilium|calico|weave'

# 测试 deny 是否生效
$ kubectl exec test-pod -- curl --max-time 3 http://backend
# 应该 timeout

7. Ingress —— 七层 HTTP 路由

Service（type=LoadBalancer / NodePort）是 4 层（TCP/UDP）。Ingress 是 7 层——基于 Host 头 / URL Path 路由到不同 Service。

graph LR
    Net[外部] -->|HTTPS 443| Ingress[Ingress Controller<br>nginx / traefik / istio]
    Ingress -->|host: a.com| SvcA[Service A]
    Ingress -->|host: b.com path /api| SvcB[Service B]
    Ingress -->|默认 backend| SvcDefault[默认 backend]

    SvcA --> PodA[Pod A]
    SvcB --> PodB[Pod B]

Ingress 的本质

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
spec:
  rules:
    - host: a.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: svc-a
                port:
                  number: 80
    - host: b.example.com
      http:
        paths:
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: svc-b
                port:
                  number: 80
  tls:
    - hosts: [a.example.com, b.example.com]
      secretName: my-tls

K8s 只定义 Ingress 资源、不实现 HTTP 路由逻辑——必须装 Ingress Controller：

ingress-nginx（最普及）
Traefik
HAProxy
Istio Gateway
Contour / Envoy

Ingress Controller 流量端到端

sequenceDiagram
    participant Ext as 外部客户端
    participant LB as 云 LB (LoadBalancer Service)
    participant IngressPod as ingress-nginx pod
    participant Svc as Service B
    participant Pod as Pod B

    Ext->>LB: HTTPS GET https://b.example.com/api
    LB->>IngressPod: 转发到节点 NodePort
    Note over IngressPod: 看 Host 头 + Path<br>查找匹配的 Ingress 规则
    IngressPod->>Svc: HTTP GET /api（重写 Host 等）
    Svc->>Pod: DNAT 到 pod
    Pod-->>Svc: HTTP 200
    Svc-->>IngressPod: 响应
    IngressPod-->>LB: 加上自己的 header (X-Forwarded-For)
    LB-->>Ext: 返回

Ingress 看实际状态

$ kubectl get ingress
NAME         CLASS   HOSTS                          ADDRESS       PORTS     AGE
my-ingress   nginx   a.example.com,b.example.com    1.2.3.4       80, 443   1d

$ kubectl describe ingress my-ingress
# 看 events / endpoints

# 看 ingress-nginx pod 实际配置（生成的 nginx.conf）
$ kubectl exec -n ingress-nginx ingress-nginx-xxx -- cat /etc/nginx/nginx.conf | head -100

Gateway API（Ingress 的接班人）

K8s 1.25+ 标准化。比 Ingress 强大：支持多协议、跨 ns 路由、更细粒度规则。

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-route
spec:
  parentRefs:
    - name: my-gateway
  hostnames:
    - "a.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /
      backendRefs:
        - name: svc-a
          port: 80

逐步替代 Ingress，但 Ingress 仍是主流。

8. 真实场景排查：综合实战

场景 1：Service 不通

$ kubectl exec test-pod -- curl http://my-svc:8080
# Connection timed out

# 第 1 步：endpoint 在吗？
$ kubectl get endpoints my-svc
NAME      ENDPOINTS   AGE
my-svc    <none>      1h        ← 空！

# 第 2 步：selector 和 pod label 对吗？
$ kubectl get svc my-svc -o jsonpath='{.spec.selector}'
{"app":"my-app"}

$ kubectl get pod -l app=my-app
No resources found.              ← pod label 错

$ kubectl get pod --show-labels | grep my-app
my-pod    1/1     Running   ...   application=my-app    ← 应该叫 app

修法：改 pod 的 label 或 Service 的 selector。

场景 2：endpoint 在，但 connection refused

$ kubectl get endpoints my-svc
NAME      ENDPOINTS               AGE
my-svc    10.244.0.5:8080         5m

$ kubectl exec test-pod -- nc -zv 10.244.0.5 8080
# Connection refused

# pod 内 ss 看监听
$ kubectl exec my-pod -- ss -lntp
LISTEN 0 128 127.0.0.1:8080    process=app    ← 只监听 127.0.0.1！

应用没绑 0.0.0.0、只绑了 127.0.0.1 → 集群里别的 pod 访问不了。

修法：应用启动参数加 --bind 0.0.0.0。

场景 3：偶发性 timeout

$ kubectl exec test-pod -- curl http://my-svc:8080
# Sometimes 200 OK, sometimes timeout

多 endpoint 中某些 pod 是坏的——iptables 按概率分流、坏的 endpoint 被选时就超时。

$ kubectl get endpoints my-svc
NAME      ENDPOINTS                                       AGE
my-svc    10.244.0.5:8080,10.244.1.6:8080,10.244.2.7:8080  1h

$ for ep in 10.244.0.5 10.244.1.6 10.244.2.7; do
    echo "=== $ep ==="
    kubectl exec test-pod -- curl -m 3 http://$ep:8080
  done

# 看哪个失败

如果只一两个失败 → 那些 pod 应用没起或没 ready。readinessProbe 应该把它们从 endpoint 里踢掉，但你的 readinessProbe 没配 / 配错。

场景 4：NodePort 通、ClusterIP 不通

# 外部访问 NodePort
$ curl http://⟨node-ip⟩:30080      # 通

# 集群内访问 ClusterIP
$ kubectl exec test-pod -- curl http://my-svc:8080   # 不通

NodePort 在节点 PREROUTING 链 DNAT、走 KUBE-NODEPORTS。 ClusterIP 在 OUTPUT 链 DNAT、走 KUBE-SERVICES。

两个链是不同 path。看 ClusterIP 这条：

$ iptables -t nat -L KUBE-SERVICES -n -v | grep ⟨cluster-ip⟩
# pkts = 0?           ← 流量没到 KUBE-SERVICES

可能：

kube-proxy 出问题（没生成规则）：kubectl logs -n kube-system kube-proxy-xxx
Service 的 ClusterIP 被你假设错（不是 10.96.x.x 而是别的段）

场景 5：跨 namespace 访问失败

$ kubectl exec pod-in-ns-a -- curl http://svc-in-ns-b
# Could not resolve host

跨 ns 必须完整域名或 search domain 覆盖：

$ kubectl exec pod-in-ns-a -- curl http://svc-in-ns-b.ns-b.svc.cluster.local
# 通

详见 04-dns-deep.md。

9. 反面教材合集

反面 1：以为 hostPort 等于 NodePort

# Pod spec
ports:
  - containerPort: 8080
    hostPort: 30080            # ← 这是 hostPort

hostPort 让 pod 直接绑节点的 30080 端口（绕开 Service）。docker run -p 30080:8080 等价。

NodePort 是 Service 类型、kube-proxy 管理。

区别：

hostPort 直接占节点端口，pod 调度时端口冲突会失败
NodePort 经 iptables，所有节点都暴露

K8s 不推荐 hostPort（除非必要）。

反面 2：以为 Service IP 是 floating IP

# 误以为
$ ping 10.96.1.5     # 期望: 通

# 真实
# 不通 (ClusterIP 不响应 ICMP)

ClusterIP 只是 iptables 规则里的"标识"、没人监听。用 curl / nc 测。

反面 3：删了 Service 但 endpoint 还在

$ kubectl delete svc my-svc
$ kubectl get endpoints my-svc      # 居然还在？

老 K8s 版本中 endpoint 是单独资源、Service 删了 endpoint 可能延迟清理。新版（1.21+）用 EndpointSlice，自动级联。

如果遗留 endpoint 阻挡新 Service：

$ kubectl delete endpoints my-svc

反面 4：用 hostNetwork 解决"网络复杂"

spec:
  hostNetwork: true

pod 共享节点网络（用节点 IP）。问题：

pod 端口和节点端口冲突
失去 Service 抽象
跨节点调度麻烦
NetworkPolicy 不生效（policy 基于 pod 网络）

只用于：DNS / CNI / 监控类必须看节点网络的特殊 pod。业务 pod 永远 hostNetwork=false。

反面 5：用 ClusterFirst 期望访问外网快

spec:
  dnsPolicy: ClusterFirst     # 默认

ClusterFirst = 先问 CoreDNS、CoreDNS 解析不了再 forward 上游 DNS。

外部域名（如 github.com）：

pod → CoreDNS（10.96.0.10）
CoreDNS forward → 节点 /etc/resolv.conf 上游
上游 DNS → github.com IP

比直接走节点 DNS 多一跳。但收益是：集群内 Service 解析一致。

如果想跳过 CoreDNS：

dnsPolicy: Default          # 用节点 /etc/resolv.conf

但这样集群内 Service 解析就坏了——慎用。

反面 6：以为 LoadBalancer 在自建集群也有 IP

$ kubectl get svc my-svc
NAME      TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)
my-svc    LoadBalancer   10.96.1.5    <pending>      8080:30080/TCP
                                       ^^^^^^^^^
                                       永远 pending

<pending> 因为没有 cloud-controller 来分配 LB。自建集群方案：

MetalLB：用 ARP / BGP "宣告" Service IP
kube-vip：类似
用 NodePort + 外部负载均衡器（自己搭 haproxy）

10. 排查 Cheatsheet

"Service 不通" 排查 7 步

flowchart TD
    Start[Service 不通] --> S1{1. Service 存在?}
    S1 -->|否| F1[kubectl create / 检查 manifest]
    S1 -->|是| S2{2. Endpoint 有 IP?}
    S2 -->|否| F2[检查 selector / pod ready / port name]
    S2 -->|是| S3{3. 直连 pod IP 通?}
    S3 -->|否| F3[应用监听问题 / NetworkPolicy 拦]
    S3 -->|是| S4{4. nc -zv ClusterIP port 通?}
    S4 -->|否| F4[iptables 规则没生成 / kube-proxy 死]
    S4 -->|是| S5{5. curl 测 L7 通?}
    S5 -->|否| F5[应用层问题 / 路径不对]
    S5 -->|是| S6{6. 跨 ns? 域名全?}
    S6 -->|否| F6[svc.ns.svc.cluster.local]
    S6 -->|是| OK[✓ 应用问题]

7 个关键命令

# 1. Service 状态
kubectl get svc ⟨name⟩

# 2. Endpoint 列表
kubectl get endpoints ⟨name⟩
kubectl get endpointslice -l kubernetes.io/service-name=⟨name⟩

# 3. pod ready 状态
kubectl get pod -l <selector> -o wide

# 4. 应用监听
kubectl exec ⟨pod⟩ -- ss -lntp

# 5. iptables 规则
iptables -t nat -L KUBE-SERVICES -n -v | grep ⟨cluster-ip⟩

# 6. DNS 解析
kubectl exec ⟨pod⟩ -- dig ⟨svc-name⟩.⟨ns⟩.svc.cluster.local

# 7. NetworkPolicy 检查
kubectl get netpol -A

11. 进阶：Service Mesh 改了什么

普通 K8s：

Pod A → iptables DNAT → Pod B

装了 Istio / Linkerd：

Pod A → istio-proxy (sidecar) → iptables → 远端 istio-proxy → Pod B

每个 pod 多了个 Envoy / Linkerd-proxy sidecar，所有流量都被它劫持。带来：

mTLS（pod 间自动 TLS）
流量管理（金丝雀 / 蓝绿）
可观察性（distributed tracing）

但也带来：

复杂度
资源开销
排错难度（多一个 hop）

Service Mesh 让排错更难

K8s service mesh 装上之后，pod 之间通信经过 sidecar：

应用 → 应用本地 sidecar → 节点网络 → 对端 sidecar → 对端应用

每跳都可能出问题。排错时记得 kubectl logs ⟨pod⟩ -c istio-proxy 也要看。

12. 下一步

文档	内容
00-mental-model.md	网络心智模型基础
01-output-reading.md	看懂命令输出
02-namespaces.md	netns / veth / bridge 底层
本篇	K8s Service / Ingress 完整
04-dns-deep.md	CoreDNS / ndots / 解析坑
05-troubleshooting.md	故障排查方法论

命令文档

每个命令详细参数 / 踩坑见 ../commands/。本篇调用过的：