VXLAN

这一篇告诉你 K8s pod 网络从最底层怎么搭起来——不用 K8s，徒手在一台机器上用 ip netns 命令搓一个"两个 pod 在同节点互通"的网络。
搞清楚这些原语之后，你再看 K8s pod 网络就不"魔幻"了。

这篇要回答什么

K8s "每个 pod 一个独立网络"是怎么实现的？
pod 里 eth0 和节点 eth0 是不是一回事？
veth pair / bridge / netns 这三样到底分别做什么？
跨节点 pod 通信 VXLAN 是怎么封包的？
怎么从节点进入 pod 的 netns 调试（nsenter 神器）？

1. Linux Namespaces —— 容器"隔离"的根本

容器（Docker / K8s pod）的"隔离感"不是虚拟机那种硬件虚拟化，而是内核命名空间。

7 种 namespace

Namespace	隔离什么
net	网络栈：网卡 / IP / 路由 / iptables / socket
mnt	文件系统挂载点
pid	进程 PID（容器里看到 PID 1 是它自己的）
user	UID / GID 映射
uts	hostname / domainname
ipc	System V IPC / POSIX 消息队列
cgroup	cgroup 视图
time	时钟（Linux 5.6+）

看进程在哪些 namespace

$ ls -l /proc/<PID>/ns/
total 0
lrwxrwxrwx 1 root root cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root ipc    -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root mnt    -> 'mnt:[4026532149]'      ← 容器有自己的 mnt
lrwxrwxrwx 1 root root net    -> 'net:[4026532152]'      ← 容器有自己的 net ← 关键
lrwxrwxrwx 1 root root pid    -> 'pid:[4026532151]'
lrwxrwxrwx 1 root root user   -> 'user:[4026531837]'     ← 大多数容器和宿主共享
lrwxrwxrwx 1 root root uts    -> 'uts:[4026532150]'

数字（4026531835）是 namespace 的 inode 号——数字一样 = 同一个 namespace。

看哪些进程共享同一个 netns

$ ls -l /proc/*/ns/net 2>/dev/null | awk '{print $NF, $9}' | sort | uniq -c | head
    1 net:[4026531840] /proc/1/ns/net
    5 net:[4026532152] /proc/12345/ns/net
    5 net:[4026532153] /proc/56789/ns/net

5 = 5 个进程共享同一个 netns（一个 pod 里的多个容器进程）。

K8s pod 的关键设计：pod 内所有容器共享 net + ipc namespace，所以同 pod 容器可以 localhost 互访。

2. net namespace —— 这篇的主角

每个 net namespace 是一个独立的网络栈——它有：

自己的网卡
自己的路由表
自己的 iptables / nftables 规则
自己的 ARP 表
自己的 socket（监听端口隔离）
自己的 /proc/net/* 视图

把一个进程放进新 netns，它就看不到宿主的网络——除非你给它配了网卡。

手搓两个 netns

# 1. 创建两个 netns
ip netns add ns1
ip netns add ns2

# 2. 看
ip netns list
# ns1
# ns2

# 3. 在 ns1 里跑命令
ip netns exec ns1 ip addr
# 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN
#     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# 只有 loopback，啥都没！而且 loopback 也 DOWN 着

新 netns 只有 loopback、且 DOWN 着。要先 up：

ip netns exec ns1 ip link set lo up
ip netns exec ns1 ping 127.0.0.1
# 通了

但 ns1 现在还完全和外界隔离——它需要网卡才能通外面。这就引出 veth pair。

3. veth pair —— 连两个 netns 的"虚拟网线"

veth (Virtual Ethernet) pair 是一对成对的虚拟网卡——发到其中一头的包，另一头收到（像一根直连网线）。

flowchart LR
    A[netns 1<br>veth-ns1] <-.veth pair.-> B[netns 2<br>veth-ns2]

    style A fill:#e1f5ff
    style B fill:#ffe1f5

创建 veth pair + 连两个 netns

# 1. 创建一对 veth
ip link add veth-a type veth peer name veth-b

# 此时它们都在 root netns
ip link show veth-a
# veth-a@veth-b: ...

# 2. 把 veth-a 移到 ns1
ip link set veth-a netns ns1

# 3. 把 veth-b 移到 ns2
ip link set veth-b netns ns2

# 4. 在 ns1 里配置 veth-a
ip netns exec ns1 ip link set veth-a up
ip netns exec ns1 ip addr add 10.1.1.1/24 dev veth-a

# 5. 在 ns2 里配置 veth-b
ip netns exec ns2 ip link set veth-b up
ip netns exec ns2 ip addr add 10.1.1.2/24 dev veth-b

# 6. 互通
ip netns exec ns1 ping 10.1.1.2
# 64 bytes from 10.1.1.2: ...

这就是 K8s pod 网络的最小原型：两个 netns + 一对 veth = 两个"pod" 通信。

但 N 个 netns 怎么互通？—— 需要 bridge

每两个 netns 都拉一对 veth？N 个就是 N² 对——不现实。

bridge 就是答案：一个虚拟"交换机"，所有 netns 的一头 veth 都连到这个 bridge 上。

4. bridge —— 虚拟交换机

flowchart TB
    subgraph A[netns A]
        EA[veth-a-1<br>10.1.1.10/24]
    end
    subgraph B[netns B]
        EB[veth-b-1<br>10.1.1.20/24]
    end
    subgraph C[netns C]
        EC[veth-c-1<br>10.1.1.30/24]
    end

    Bridge[br0 bridge<br>10.1.1.1/24]
    EA -.veth pair.-> H1[veth-a-2]
    EB -.veth pair.-> H2[veth-b-2]
    EC -.veth pair.-> H3[veth-c-2]
    H1 --> Bridge
    H2 --> Bridge
    H3 --> Bridge

    style A fill:#e1f5ff
    style B fill:#e1f5ff
    style C fill:#e1f5ff
    style Bridge fill:#ffe1c4

手搓 bridge + 3 个 netns

# 1. 创建 bridge
ip link add br0 type bridge
ip link set br0 up
ip addr add 10.1.1.1/24 dev br0

# 2. 对每个 netns:
#    a) 创建 veth pair
#    b) 一头放进 netns、配 IP
#    c) 另一头连到 bridge

for ns in nsA nsB nsC; do
  ip netns add $ns
  ip netns exec $ns ip link set lo up

  # 创建 veth pair
  ip link add veth-$ns-host type veth peer name veth-$ns-ns

  # netns 端：放进去 + 配 IP
  ip link set veth-$ns-ns netns $ns
  case $ns in
    nsA) IP=10.1.1.10 ;;
    nsB) IP=10.1.1.20 ;;
    nsC) IP=10.1.1.30 ;;
  esac
  ip netns exec $ns ip link set veth-$ns-ns up
  ip netns exec $ns ip addr add $IP/24 dev veth-$ns-ns
  ip netns exec $ns ip route add default via 10.1.1.1

  # host 端：up + 加入 bridge
  ip link set veth-$ns-host up
  ip link set veth-$ns-host master br0
done

# 3. 验证三个 ns 互通
ip netns exec nsA ping -c 1 10.1.1.20
ip netns exec nsB ping -c 1 10.1.1.30
ip netns exec nsC ping -c 1 10.1.1.10
# 都通

恭喜——你刚徒手搭了一个"单节点 K8s pod 网络"。

看 bridge 状态

$ ip link show master br0
3: veth-nsA-host@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0
5: veth-nsB-host@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0
7: veth-nsC-host@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0

$ bridge link show
3: veth-nsA-host@if2: ... master br0 state forwarding ...

bridge 内部维护 MAC 表（FDB - Forwarding Database），知道每个 MAC 在哪个 port 上：

$ bridge fdb show br br0
33:33:00:00:00:01 dev br0 self permanent
ee:ee:01:02:03:04 dev veth-nsA-host master br0
ff:ff:02:03:04:05 dev veth-nsB-host master br0
...

跟传统硬件交换机一样的工作机制：收到包 → 看目标 MAC → 查 FDB → 转发到对应 port。

5. K8s 单节点 pod 网络的实物

K8s + Flannel CNI 装完之后，节点上自动有：

$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> ...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...     # 物理网卡
3: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 ...     # CNI 创建的 bridge
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 ...# Flannel overlay 设备
5: vethABCD1234@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> ... master cni0   # pod A 的 host 端
6: vethEFGH5678@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> ... master cni0   # pod B 的 host 端

每个 vethXXXX@if3 都是节点上"对接某个 pod 的虚拟网线"，master cni0 表示它连到 bridge cni0。

pod 内视角

$ kubectl exec my-pod -- ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> ...
3: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 ...
    link/ether 0a:58:0a:f4:00:05 brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.5/24 brd 10.244.0.255 scope global eth0

注意 eth0@if5——这表示 pod 里的 eth0 的另一头在宿主 ifindex=5 的网卡（就是上面看到的 vethABCD1234）。

怎么在节点上反查"这是哪个 pod 的 veth"

# 节点上看所有 veth
$ ip -o link | grep '@if'
5: vethABCD1234@if3: ...
6: vethEFGH5678@if3: ...

# 进 pod netns 看 pod 的 eth0 对端 ifindex
$ kubectl get pod my-pod -o jsonpath='{.metadata.uid}'
abc123-...
$ PID=$(crictl inspect $(crictl ps -q --name my-pod) | jq '.info.pid')
$ nsenter -t $PID -n ip link
3: eth0@if5: ...                         # ← @if5 表示 host 侧 ifindex=5

# 节点上找 ifindex=5 的网卡
$ ip link show | awk -F': ' '/^5:/{print $2}'
vethABCD1234@if3                          # ← 这就是 my-pod 的 veth

排查 pod 网络丢包、CNI 异常时常用。

6. 跨节点：VXLAN overlay

单节点搞定了，跨节点呢？两个不同节点上的 pod IP（10.244.0.5 / 10.244.1.6）怎么通信？

Overlay 方案（默认）

sequenceDiagram
    autonumber
    participant PA as Pod A<br>10.244.0.5<br>(节点 m1)
    participant CN1 as m1 cni0
    participant FL1 as m1 flannel.1<br>(VTEP)
    participant E1 as m1 eth0<br>10.0.24.28
    participant E2 as m2 eth0<br>10.0.24.29
    participant FL2 as m2 flannel.1
    participant CN2 as m2 cni0
    participant PB as Pod B<br>10.244.1.6<br>(节点 m2)

    PA->>CN1: src=10.244.0.5<br>dst=10.244.1.6
    Note over CN1: 路由查询:<br>10.244.1.0/24 → flannel.1
    CN1->>FL1: 转给 flannel.1
    Note over FL1: VXLAN 封装:<br>外层 src=10.0.24.28<br>外层 dst=10.0.24.29<br>VNI=1<br>内层 = 原始 IP 包
    FL1->>E1: UDP 4789
    E1->>E2: 跨节点 (eth0 之间)
    Note over FL2: 解封装
    FL2->>CN2: 内层包<br>src=10.244.0.5 dst=10.244.1.6
    CN2->>PB: 交付

每个节点的 flannel.1 是 VTEP（VXLAN Tunnel Endpoint）。flannel daemon 维护 "对端 pod CIDR → 对端节点 IP" 的映射。

Underlay / Native routing

不封装，pod IP 直接被路由器知道。性能更好但要网络支持：

同子网：路由器学到 pod IP → 节点 MAC 的对应
跨子网：BGP 让路由器交换 pod CIDR 路由（Calico BGP / Cilium native routing）

Calico BGP 节点

$ ip route | grep 10.244
10.244.1.0/24 via 10.0.24.32 dev eth0 proto bird   # ← bird 是 BGP daemon
10.244.2.0/24 via 10.0.24.33 dev eth0 proto bird

proto bird 表示这条路由是 BGP daemon (bird) 学到的。没有 overlay 设备。

Cilium native routing

类似，但用 eBPF 直接处理 datapath、不依赖 iptables。

:::

VXLAN 包格式（MTU 算账）

[Outer Ethernet (14)] [Outer IP (20)] [UDP (8)] [VXLAN (8)] [Inner Ethernet (14)] [Inner IP (20)] [TCP (20)] [Data]
└─────────────────── 节点间真实包 (这部分) ──────────────────────┘└─── pod 视角的包 (这里开始) ────┘

外层总开销: 14 + 20 + 8 + 8 + 14 = 64 字节
但 inner Ethernet 14 通常被算进 inner MTU 里、所以实际占用约 50 字节

物理 MTU 1500 → 内层 MTU 应该 ≤ 1450（留 50 字节给 VXLAN 封装）。

K8s CNI 自动配 cni0 / flannel.1 / cilium_host 的 MTU 1450 就是这个原因：

$ ip link show flannel.1 | grep mtu
4: flannel.1: ... mtu 1450 ...

MTU 错配 = 偶发性丢包

物理 MTU 1500 但你给 cni0 配了 1500（没扣 50）：

包从 pod 出来时 1500
在节点 flannel.1 加 VXLAN 头 → 1550 → 超过物理 MTU 1500
内核 fragment / IPv6 直接丢

症状：小包 OK、大包丢。curl http://x 短响应正常、大文件下载卡死。

排查：

# Pod 内
$ ping -s 1472 -M do ⟨pod-IP⟩             # -s = payload size, -M do = 不允许 fragment
# 1472 + 8 ICMP + 20 IP = 1500
# 失败 → 路径 MTU < 1500

修：把 cni0 / flannel.1 的 MTU 调对（详见 CNI 配置）。

7. nsenter —— 排错入口

nsenter 让你"进入"某个进程的 namespace，跑命令时仿佛你就在那个容器里——但不需要容器有 shell。

nsenter -t <PID> <namespace flags> <command>

5 个常用动作

# 进 pod 的网络 namespace 跑命令
nsenter -t <PID> -n ip addr
nsenter -t <PID> -n ip route
nsenter -t <PID> -n ss -lntp
nsenter -t <PID> -n tcpdump -i any -nn

# 进 mount namespace 看 pod 文件系统视角
nsenter -t <PID> -m ls /

# 多 namespace 一起进
nsenter -t <PID> -n -m -p bash
#               ^^ ^^ ^^
#               网络  挂载  pid

# 进所有
nsenter -t <PID> -a bash

flag 简写：

flag	进入的 namespace
`-n`	net
`-m`	mnt
`-p`	pid
`-u`	uts (hostname)
`-i`	ipc
`-C`	cgroup
`-U`	user
`-T`	time
`-a`	all

黄金套路：进 pod 抓包

# 1. 找 pod 容器 PID
PID=$(crictl inspect $(crictl ps -q --name my-pod) | jq '.info.pid')

# 2. 进 pod netns 抓包
nsenter -t $PID -n tcpdump -i any -nn -c 30

# 3. 进 pod netns 测连通
nsenter -t $PID -n curl -v http://10.96.0.10

# 4. 进 pod netns 看 socket 状态
nsenter -t $PID -n ss -tnp

K8s pod 没装 curl / tcpdump 怎么办？nsenter 在节点上用节点的工具进 pod netns 跑，绕开"pod 镜像里没工具"的限制。

详见 ../commands/nsenter.md。

8. 实战：完整搓一遍"两个 pod 通信"

#!/bin/bash
set -e

# 清理上次（容错）
ip netns del nsA 2>/dev/null || true
ip netns del nsB 2>/dev/null || true
ip link del br-demo 2>/dev/null || true

echo "=== 1. 创建 bridge (模拟 cni0) ==="
ip link add br-demo type bridge
ip link set br-demo up
ip addr add 10.99.0.1/24 dev br-demo

echo "=== 2. 创建 podA 的 netns + veth ==="
ip netns add nsA
ip netns exec nsA ip link set lo up

ip link add veth-a type veth peer name veth-a-host
ip link set veth-a netns nsA
ip link set veth-a-host master br-demo
ip link set veth-a-host up

ip netns exec nsA ip link set veth-a up
ip netns exec nsA ip addr add 10.99.0.10/24 dev veth-a
ip netns exec nsA ip route add default via 10.99.0.1

echo "=== 3. 创建 podB ==="
ip netns add nsB
ip netns exec nsB ip link set lo up

ip link add veth-b type veth peer name veth-b-host
ip link set veth-b netns nsB
ip link set veth-b-host master br-demo
ip link set veth-b-host up

ip netns exec nsB ip link set veth-b up
ip netns exec nsB ip addr add 10.99.0.20/24 dev veth-b
ip netns exec nsB ip route add default via 10.99.0.1

echo "=== 4. 验证 ==="
ip netns exec nsA ping -c 2 10.99.0.20
ip netns exec nsB ping -c 2 10.99.0.10

echo "=== 5. 让 podA 能访问外网 ==="
# 启用主机转发
sysctl -w net.ipv4.ip_forward=1
# SNAT 出口流量
iptables -t nat -A POSTROUTING -s 10.99.0.0/24 ! -o br-demo -j MASQUERADE

ip netns exec nsA ping -c 2 8.8.8.8
# 通了

echo "=== 完成 ==="

跑这个脚本你就有两个"pod" 在同节点互通 + 通外网。K8s 的 CNI 做的就是更复杂版本的这一套。

拆掉

ip netns del nsA
ip netns del nsB
ip link del br-demo

9. 反面教材

❌ 误解 1：pod 里 eth0 就是宿主 eth0

pod 里 ip link 显示:
3: eth0@if5: ...

宿主 ip link 显示:
2: eth0: ...
3: cni0: ...
5: vethABCD@if3: ...

完全不是一个东西。pod 的 eth0 是 veth pair 的一头、对应宿主的 vethABCD。物理 eth0 是节点的网卡。@if5 才是连接信号——表明 pod 的 eth0 对端是宿主 ifindex=5。

❌ 误解 2：删 veth 一头另一头还在

ip link del veth-a            # 删一头
ip link show veth-b            # 不存在了

veth pair 是不可分的——删一头，对端自动消失。

❌ 误解 3：netns 名字在 `ip netns list` 一定能看到

K8s 创建的 pod netns 不会出现在 ip netns list：

$ ip netns list
# 空 / 只有手动 add 的

因为 K8s / containerd 创建的 netns 不挂在 /var/run/netns/（ip netns 命令的默认查找路径），而是直接以进程的 nsfs 引用形式存在。

要看 pod netns 要用 nsenter -t <PID> -n，不能用 ip netns exec ⟨name⟩。

把 pod netns "暴露"给 ip netns：

PID=$(crictl inspect ... | jq '.info.pid')
mkdir -p /var/run/netns
ln -s /proc/$PID/ns/net /var/run/netns/mypod-ns

# 现在能用 ip netns
ip netns exec mypod-ns ip addr
ip netns list                      # 能看到 mypod-ns 了

10. 节点上看 K8s 网络全景（收尾综合）

# 节点上几乎所有网络元素
$ ip link show | grep -E "^[0-9]+:"
1: lo: ...                                                      # 回环
2: eth0: ...                                                    # 物理网卡
3: docker0: ...                                                 # docker（可能存在残留）
4: cni0: ... master br0                                         # K8s pod bridge
5: flannel.1: ...                                               # overlay 设备（如有）
6: vxlan.calico: ...                                            # Calico VXLAN
7: cilium_host@cilium_net: ...                                  # Cilium
8: cilium_net@cilium_host: ...
9: vethABCD@if3: ... master cni0                                # pod A 的 host 端
10: vethEFGH@if3: ... master cni0                                # pod B 的 host 端
...

# 看 K8s pod CIDR 怎么分的
$ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
m1    10.244.0.0/24
m2    10.244.1.0/24
m3    10.244.2.0/24
m4    10.244.3.0/24
m5    10.244.4.0/24

# 看节点上的 CNI 配置
$ cat /etc/cni/net.d/*
{
  "cniVersion": "0.3.1",
  "name": "cbr0",
  "plugins": [
    { "type": "flannel", ... },
    { "type": "portmap", ... }
  ]
}

# 看 pod 网络配置（CNI 状态）
$ ls /var/lib/cni/                       # 一些 CNI 会在这里存状态

# 看 pod 实际用的 netns
$ for pod in $(crictl ps -q); do
    pid=$(crictl inspect $pod | jq -r '.info.pid')
    echo "pod=$pod pid=$pid netns=$(readlink /proc/$pid/ns/net)"
  done

11. 排查 cheatsheet

Pod 起来但 IP 不对

# pod IP 是 0.0.0.0 / 空？
$ kubectl get pod my-pod -o jsonpath='{.status.podIP}'

# 节点上看 CNI 配置
$ ls /etc/cni/net.d/                       # 应该有 *.conf 或 *.conflist
$ cat /etc/cni/net.d/*

# 看 CNI 日志
$ ls /var/log/containers/ | grep cni        # 取决于 CNI
$ kubectl logs -n kube-system ⟨cni-pod⟩     # 如 flannel / calico-node / cilium

# 节点 kubelet 日志
$ journalctl -u kubelet --since "5 min ago" | grep -i 'cni\|network'

Pod 间不通（同节点）

PID_A=$(crictl inspect $(crictl ps -q --name pod-a) | jq '.info.pid')
PID_B=$(crictl inspect $(crictl ps -q --name pod-b) | jq '.info.pid')

# 看两个 pod 的 IP
nsenter -t $PID_A -n ip addr
nsenter -t $PID_B -n ip addr

# bridge 学到这些 MAC 了吗
bridge fdb show br cni0 | head

# pod A → pod B 抓包
nsenter -t $PID_A -n ping -c 3 ⟨pod-b-ip⟩ &
tcpdump -i cni0 -nn host ⟨pod-b-ip⟩ -c 10

# 看 NetworkPolicy 是不是挡了
kubectl get netpol -A

Pod 间不通（跨节点）

# 1. 节点之间通吗
ssh m1 'ping -c 3 ⟨m2-IP⟩'
ssh m1 'mtr -rn -c 30 -T -P 4789 ⟨m2-IP⟩'   # VXLAN 端口

# 2. 路由对吗
ssh m1 'ip route | grep 10.244.1.0'
# 应该: 10.244.1.0/24 via ⟨m2-IP⟩ ...

# 3. VXLAN 设备 UP 吗
ssh m1 'ip link show flannel.1'              # 或 cilium_vxlan 等

# 4. 抓包看两端
ssh m1 'tcpdump -i any -nn host ⟨pod-b-ip⟩' &
ssh m2 'tcpdump -i any -nn host ⟨pod-a-ip⟩' &

12. 下一步

文档	内容
00-mental-model.md	网络心智模型基础
01-output-reading.md	看懂命令输出
本篇	netns / veth / bridge / VXLAN 底层
03-k8s-network-deep.md	Service / NodePort / Ingress 完整流量
04-dns-deep.md	DNS / CoreDNS 深入
05-troubleshooting.md	故障排查方法论