# k8s实践：Cilium集群IPAM实现

## 背景

通过前面遇到的[**Cilium集群下线节点导致某个在线节点容器网络不通问题**](/blog/k8s/k8s-shi-jian-cilium-ji-qun-xia-xian-jie-dian-dao-zhi-qi-ta-jie-dian-rong-qi-wang-luo-bu-tong.md)排查结果，可以确定集群中因为k8s Node资源spec.podCIDRs与CiliumNode资源的spec.ipam.podCIDRs字段不一致，而节点上Pod网段实际以CiliumNode资源的spec.ipam.podCIDRs为准，但是Cilium上报网络设备的BGP路由信息以k8s Node资源spec.podCIDRs为准，这就导致了网络设备上收到的路由信息与节点实际的Pod网段不一致，当下线一个节点时可能会影响另一个节点Pod网络连通性。环境信息如下所示：

```shell
# kubectl  get node  -o custom-columns=NAME:.metadata.name,PodCIDR:.spec.podCIDRs
NAME      PodCIDR
master1   [10.5.112.0/24]
master2   [10.5.114.0/24]
master3   [10.5.113.0/24]
node01    [10.5.115.0/24]

# kubectl  get ciliumnode  -o custom-columns=NAME:.metadata.name,PodCIDR:.spec.ipam.podCIDRs
NAME      PodCIDR
master1   [10.5.113.0/24]
master2   [10.5.115.0/24]
master3   [10.5.114.0/24]
node01    [10.5.112.0/24]
```

但是这个问题难道就是个Cilium项目的Bug么？还是我们配置不恰当导致的？带着问题我们再深入分析一波

## 问题分析

### 环境信息

当前集群的Cilium配置（这里省略掉一些无关的配置）

```shell
# kubectl  -n kube-system get cm cilium-config -oyaml
apiVersion: v1
data:
  bgp-announce-lb-ip: "true"
  bgp-announce-pod-cidr: "true"
  cluster-name: default
  cluster-pool-ipv4-cidr: 10.5.112.0/21
  cluster-pool-ipv4-mask-size: "24"
  enable-ipv4: "true"
  enable-ipv4-masquerade: "false"
  identity-allocation-mode: crd
  install-iptables-rules: "true"
  install-no-conntrack-iptables-rules: "false"
  ipam: cluster-pool
  kube-proxy-replacement: strict
  ...
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
```

这里配置的IPAM模式为`cluster-pool`。

### Cilium IPAM Mode

查看[Cilium文档](https://docs.cilium.io/en/v1.11/concepts/networking/ipam/)我们知道常见IPAM Mode支持以下类型：Cluster Scope (Default)、Kubernetes Host Scope。

#### Kubernetes Host Scope

开启Kubernetes host-scope IPAM模式需要设置`ipam: kubernetes`，这种模式下Cilium直接使用k8s Node的podCIDRs作为当前节点Pod网段并用于分配Pod IP。

**架构**

<figure><img src="/files/qxJ5v81MSZZZUL7WlV7J" alt=""><figcaption></figcaption></figure>

> 注意：该模式下需要依赖`kube-controller-manager`组件开启`--allocate-node-cidrs`参数来自动为每一个节点分配podCIDRs

#### Cluster Scope (Default)

cluster-scope IPAM模式原理类似于Kubernetes Host Scope模式，它也是每个节点分配一个独立的PodCIDRs。不同之处是它通过Cilium Operator来管理每个节点的PodCIDRs，这些信息维护在CRD `Cilium`资源里面。这个模式的优点是不依赖于k8s为每个节点分配的PodCIDRs。

**架构**

<figure><img src="/files/J6tj1ecOrU1n8YiMWz1X" alt=""><figcaption></figcaption></figure>

### BGP功能源码分析

#### Cilium 1.11.9

这里我们发现Cilium bgpSpeaker宣告的podCIDRs只支持从k8s Node的podCIDRs获取，更详细的源码分析请参考[源码追踪](/blog/k8s/k8s-shi-jian-cilium-ji-qun-xia-xian-jie-dian-dao-zhi-qi-ta-jie-dian-rong-qi-wang-luo-bu-tong.md#yuan-ma-zhui-zong)

```go
// pkg/bgp/speaker/speaker.go
// OnAddNode notifies the Speaker of a new node.
func (s *MetalLBSpeaker) OnAddNode(node *v1.Node, swg *lock.StoppableWaitGroup) error {
	...
	// 收到Node Add事件时会构造一个nodeEvent对象添加到bgpSpeaker的队列中，注意这里podCIDRs是从k8s node资源的Spec.PodCIDRs字段获取的
	s.queue.Add(nodeEvent{
		Meta:     meta,
		op:       Add,
		labels:   nodeLabels(node.Labels),
		podCIDRs: podCIDRs(node),
	})
	return nil
}

func podCIDRs(node *v1.Node) *[]string {
	...
	podCIDRs := make([]string, 0, len(node.Spec.PodCIDRs))
	if pc := node.Spec.PodCIDR; pc != "" {
		if len(node.Spec.PodCIDRs) > 0 && pc != node.Spec.PodCIDRs[0] {
			podCIDRs = append(podCIDRs, pc)
		}
	}
	podCIDRs = append(podCIDRs, node.Spec.PodCIDRs...)
	return &podCIDRs
}
```

#### Cilium 1.14.2

这里发现Cilium bgpSpeaker宣告的podCIDRs获取方式支持k8s Node和CiliumNode两种方式：

```go
// newDaemon creates and returns a new Daemon with the parameters set in c.
func newDaemon(ctx context.Context, cleaner *daemonCleanup, params *daemonParams) (*Daemon, *endpointRestoreState, error) {
	...
	// 这里支持根据IPAM模式从不同资源获取podCIDRs
	d.k8sWatcher.RegisterNodeSubscriber(d.endpointManager)
	if option.Config.BGPAnnounceLBIP || option.Config.BGPAnnouncePodCIDR {
		switch option.Config.IPAMMode() {
		case ipamOption.IPAMKubernetes:
			d.k8sWatcher.RegisterNodeSubscriber(d.bgpSpeaker)
		case ipamOption.IPAMClusterPool:
			d.k8sWatcher.RegisterCiliumNodeSubscriber(d.bgpSpeaker)
		}
	}
	...
}

```

收到k8s Node事件以后两种不IPAM模式下的处理方法实现，这里只以Add事件为例

```go
// OnAddNode notifies the Speaker of a new node.
func (s *MetalLBSpeaker) OnAddNode(node *slim_corev1.Node, swg *lock.StoppableWaitGroup) error {
	return s.notifyNodeEvent(Add, node, nodePodCIDRs(node), false)
}
// OnAddCiliumNode notifies the Speaker of a new CiliumNode.
func (s *MetalLBSpeaker) OnAddCiliumNode(node *ciliumv2.CiliumNode, swg *lock.StoppableWaitGroup) error {
	return s.notifyNodeEvent(Add, node, ciliumNodePodCIDRs(node), false)
}

// notifyNodeEvent notifies the speaker of a node (K8s Node or CiliumNode) event
func (s *MetalLBSpeaker) notifyNodeEvent(op Op, nodeMeta metaGetter, podCIDRs *[]string, withDraw bool) error {
	...

	l.Debug("adding event to queue")
	s.queue.Add(nodeEvent{
		Meta:     meta,
		op:       op,
		labels:   nodeLabels(nodeMeta.GetLabels()),
		podCIDRs: podCIDRs,
		withDraw: withDraw,
	})
	return nil
}
```

两种获取podCIDRs的方式实现

```go
//从k8s Node获取PodCIDRs
func nodePodCIDRs(node *slim_corev1.Node) *[]string {
	if node == nil {
		return nil
	}
	podCIDRs := make([]string, 0, len(node.Spec.PodCIDRs))
	// this bit of code extracts the pod cidr block the node will
	// use per: https://github.com//cilium/cilium/blob/8cb6ca42179a0e325131a4c95b14291799d22e5c/vendor/k8s.io/api/core/v1/types.go#L4600
	// read the above comments to understand this access pattern.
	if pc := node.Spec.PodCIDR; pc != "" {
		if len(node.Spec.PodCIDRs) > 0 && pc != node.Spec.PodCIDRs[0] {
			podCIDRs = append(podCIDRs, pc)
		}
	}
	podCIDRs = append(podCIDRs, node.Spec.PodCIDRs...)
	return &podCIDRs
}

//从ciliumNode获取PodCIDRs
func ciliumNodePodCIDRs(node *ciliumv2.CiliumNode) *[]string {
	if node == nil {
		return nil
	}
	podCIDRs := make([]string, 0, len(node.Spec.IPAM.PodCIDRs))
	podCIDRs = append(podCIDRs, node.Spec.IPAM.PodCIDRs...)
	return &podCIDRs
}
```

查看Cilium bgpSpeaker宣告的podCIDRs支持k8s Node和CiliumNode两种方式的[commit链接](https://github.com/cilium/cilium/commit/3f8e54ee315573509aa095d335b9908cfb4d9140)，可以看到1.12以上所有版本都已经包含了该commit。

## 结论

通过以上分析我们有两个结论

* Cilium 1.12以下版本bgpSpeaker只支持从k8s Node中获取podCIDRs，所以这些版本需要开启BGP路由宣告的话，IPAM模式只能使用`Kubernetes Host Scope`。
* Cilium 1.12以上的版本bgpSpeaker支持从k8s Node和CiliumNode中获取podCIDRs，所以这些版本需要开启BGP路由宣告的话，IPAM模式`Cluster Scope (Default)`和`Kubernetes Host Scope`都可以选择。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://marswang.gitbook.io/blog/k8s/k8s-shi-jian-cilium-ji-qun-ipam-shi-xian.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
