k8s实践：Cilium集群下线节点导致其它节点容器网络不通

背景

生产环境内部使用的一个k8s集群，使用Cilium CNI开启BGP与网络设备打通了Pod网段的路由，以满足需要在集群外部通过Pod IP访问Pod服务的需求，正常使用了一段时间以后因为这个集群负载较低需要下线部分节点到其他集群，下线node1节点后发现master2节点上的Pod与集群外网络无法互通，master1和master3节点上Pod网络都正常。

集群配置如下：

# kubectl  get node -owide
NAME      STATUS                     ROLES                  AGE    VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION             CONTAINER-RUNTIME
master1   Ready                      control-plane,master   376d   v1.22.10   10.4.94.32    <none>        Ubuntu 18.04.6 LTS   5.10.0-051000   containerd://1.6.8
master2   Ready                      control-plane,master   376d   v1.22.10   10.4.94.21    <none>        Ubuntu 18.04.6 LTS   5.10.0-051000   containerd://1.6.8
master3   Ready                      control-plane,master   376d   v1.22.10   10.4.94.31    <none>        Ubuntu 18.04.6 LTS   5.10.0-051000   containerd://1.6.8
node1     Ready                      control-plane,master   376d   v1.22.10   10.4.94.23    <none>        Ubuntu 18.04.6 LTS   5.10.0-051000   containerd://1.6.8

# helm -n kube-system list
NAME          	NAMESPACE  	REVISION	UPDATED                                	STATUS  	CHART               	APP VERSION
cilium        	kube-system	3       	2022-12-28 17:18:33.518302411 +0800 CST	deployed	cilium-1.11.9       	1.11.9

# kubectl -n kube-system exec -it cilium-z486s bash
root@l004094032:/home/cilium# cilium status
KVStore:                Ok   Disabled
Kubernetes:             Ok   1.22 (v1.22.10) [linux/amd64]
Kubernetes APIs:        ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:   Strict   [bond0 10.4.94.32 (Direct Routing)]
Host firewall:          Disabled
Cilium:                 Ok   1.11.9 (v1.11.9-4409e95)
NodeMonitor:            Listening for events on 48 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok
IPAM:                   IPv4: 16/254 allocated from 10.5.113.0/24,
BandwidthManager:       Disabled
Host Routing:           BPF
Masquerading:           Disabled
Controller Status:      75/75 healthy
Proxy Status:           OK, ip 10.5.113.75, 0 redirects active on ports 10000-20000
Hubble:                 Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 38.78   Metrics: Disabled
Encryption:             Disabled
Cluster health:         4/4 reachable   (2023-10-11T02:06:57Z)

root@l004094032:/home/cilium# cilium node list
Name      IPv4 Address   Endpoint CIDR   IPv6 Address   Endpoint CIDR
master1   10.4.94.32     10.5.113.0/24
master2   10.4.94.21     10.5.115.0/24
master3   10.4.94.31     10.5.114.0/24
node1     10.4.94.23     10.5.112.0/24

BGP配置如下：

# kubectl -n kube-system get cm bgp-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    peers:
      - peer-address: 10.4.36.239
        peer-asn: 64524
        my-asn: 64513
      - peer-address: 10.4.36.238
        peer-asn: 64524
        my-asn: 64513
    address-pools:
      - name: default
        protocol: bgp
        addresses:
          - 10.5.112.0/21
kind: ConfigMap
metadata:
  name: bgp-config
  namespace: kube-system

问题定位

下线node1节点以后master2节点上的pod不能访问集群外部网络，外部网络也无法访问该节点上的pod；从master2节点上IP为10.5.115.93的Pod内ping集群外部IP 10.9.25.120，分别在master2宿主机和10.9.25.120机器上抓包如下所示：

联系网络同事查看网络设备上收到的三台节点的BGP路由信息如下：

通过以上路由表信息发现三个问题：

每个Cilium节点宣告给网络设备的网段不一定是本节点的容器网段；
node1节点下线后网络设备上的所有的bgp路由信息没有重新宣告，导致网络设备上没有收到master2节点上pod网段10.5.115.0/24的路由信息；
重启完所有cilium-agent pod以及宿主机节点以后，残留在网络设备上的已经下线节点node1节点上的路由信息10.5.112.0/24，始终没有发生更新。

猜想1：每个cilium-agent宣告给网络设备的路由信息保留在本地配置文件或者etcd里面；

结果：查找了cilium-Sagent挂在的所有配置文件以及etcd都没有发现相关信息

源码追踪

这里主要分析Cilium 1.11.9分支实现BGP网络路由协议相关功能

Cilium启动流程简单梳理，源码如下

// daemon/cmd/daemon_main.go
// 命令行核心逻辑入口函数
func runDaemon() {
	...
	d, restoredEndpoints, err := NewDaemon(ctx, cancel,
		WithDefaultEndpointManager(ctx, endpoint.CheckHealth),
		linuxdatapath.NewDatapath(datapathConfig, iptablesManager, wgAgent))

	...
}

// daemon/cmd/daemon.go
// NewDaemon creates and returns a new Daemon with the parameters set in c.
func NewDaemon(ctx context.Context, cancel context.CancelFunc, epMgr *endpointmanager.EndpointManager, dp datapath.Datapath) (*Daemon, *endpointRestoreState, error) {
  ...
  // 初始化bgpSpeaker、k8sWatcher
  if option.Config.BGPAnnounceLBIP || option.Config.BGPAnnouncePodCIDR {
		d.bgpSpeaker, err = speaker.New(ctx, speaker.Opts{
			LoadBalancerIP: option.Config.BGPAnnounceLBIP,
			PodCIDR:        option.Config.BGPAnnouncePodCIDR,
		})
		if err != nil {
			log.WithError(err).Error("Error creating new BGP speaker")
			return nil, nil, err
		}
	}
	d.k8sWatcher = watchers.NewK8sWatcher(
		d.endpointManager,
		d.nodeDiscovery.Manager,
		&d,
		d.policy,
		d.svc,
		d.datapath,
		d.redirectPolicyManager,
		d.bgpSpeaker,
		d.egressGatewayManager,
		option.Config,
	)
  ...
  // 这里向d.k8sWatcher.NodeChain中注册所有需要监听Node Event的结构体，这些结构体必须实现OnAddNode/OnUpdateNode/OnDeleteNode方法
  // 这里d.bgpSpeaker就是和网络设备同步BGP路由信息
  d.k8sWatcher.NodeChain.Register(d.endpointManager)
	if option.Config.BGPAnnounceLBIP || option.Config.BGPAnnouncePodCIDR {
		d.k8sWatcher.NodeChain.Register(d.bgpSpeaker)
	}
	if option.Config.EnableServiceTopology {
		d.k8sWatcher.NodeChain.Register(&d.k8sWatcher.K8sSvcCache)
	}
  ...
  if k8s.IsEnabled() {
		...

		// Launch the K8s node watcher so we can start receiving node events.
		// Launching the k8s node watcher at this stage will prevent all agents
		// from performing Gets directly into kube-apiserver to get the most up
		// to date version of the k8s node. This allows for better scalability
		// in large clusters.
		d.k8sWatcher.NodesInit(k8s.Client())
  }
}

// pkg/k8s/watchers/node.go
// 初始化node informer（这里会将k.NodeChain中所有对象的handler都注册到informer），启动nodeController
func (k *K8sWatcher) NodesInit(k8sClient *k8s.K8sClient) {
	onceNodeInitStart.Do(func() {
		swg := lock.NewStoppableWaitGroup()

		nodeStore, nodeController := informer.NewInformer(
			cache.NewListWatchFromClient(k8sClient.CoreV1().RESTClient(),
				"nodes", v1.NamespaceAll, fields.ParseSelectorOrDie("metadata.name="+nodeTypes.GetName())),
			&v1.Node{},
			0,
			cache.ResourceEventHandlerFuncs{
				AddFunc: func(obj interface{}) {
					var valid bool
					if node := k8s.ObjToV1Node(obj); node != nil {
						valid = true
						errs := k.NodeChain.OnAddNode(node, swg)
						k.K8sEventProcessed(metricNode, metricCreate, errs == nil)
					}
					k.K8sEventReceived(metricNode, metricCreate, valid, false)
				},
				UpdateFunc: func(oldObj, newObj interface{}) {
					var valid, equal bool
					if oldNode := k8s.ObjToV1Node(oldObj); oldNode != nil {
						valid = true
						if newNode := k8s.ObjToV1Node(newObj); newNode != nil {
							equal = nodeEventsAreEqual(oldNode, newNode)
							if !equal {
								errs := k.NodeChain.OnUpdateNode(oldNode, newNode, swg)
								k.K8sEventProcessed(metricNode, metricUpdate, errs == nil)
							}
						}
					}
					k.K8sEventReceived(metricNode, metricUpdate, valid, equal)
				},
				DeleteFunc: func(obj interface{}) {
				},
			},
			nil,
		)

		k.nodeStore = nodeStore

		k.blockWaitGroupToSyncResources(wait.NeverStop, swg, nodeController.HasSynced, k8sAPIGroupNodeV1Core)
		go nodeController.Run(wait.NeverStop)
		k.k8sAPIGroups.AddAPI(k8sAPIGroupNodeV1Core)
	})
}

bgpSpeaker控制器

控制器初始化及启动

// pkg/bgp/speaker/speaker.go
// New creates a new MetalLB BGP speaker controller. Options are provided to
// specify what the Speaker should announce via BGP.
func New(ctx context.Context, opts Opts) (*MetalLBSpeaker, error) {
  // 使用bgp配置文件创建一个MetalLB BGP speaker控制器
	ctrl, err := newMetalLBSpeaker(ctx)
	spkr := &MetalLBSpeaker{
		Fencer:  fence.Fencer{},
		speaker: ctrl,

		announceLBIP:    opts.LoadBalancerIP,
		announcePodCIDR: opts.PodCIDR,

		queue: workqueue.New(),
		services: make(map[k8s.ServiceID]*slim_corev1.Service),
	}

  // 启动控制器
	go spkr.run(ctx)
	return spkr, nil
}

handler方法实现

这里主要分析一下Add事件的处理逻辑，其它事件处理逻辑类似。

这里主要关注podCIDRs如何获取的，通过podCIDRs函数逻辑可知队列中podCIDRs字段是从k8s Node资源的Spec.PodCIDRs字段获取的。

// pkg/bgp/speaker/speaker.go
// OnAddNode notifies the Speaker of a new node.
func (s *MetalLBSpeaker) OnAddNode(node *v1.Node, swg *lock.StoppableWaitGroup) error {
	...
	// 收到Node Add事件时会构造一个nodeEvent对象添加到bgpSpeaker的队列中，注意这里podCIDRs是从k8s node资源的Spec.PodCIDRs字段获取的
	s.queue.Add(nodeEvent{
		Meta:     meta,
		op:       Add,
		labels:   nodeLabels(node.Labels),
		podCIDRs: podCIDRs(node),
	})
	return nil
}

func podCIDRs(node *v1.Node) *[]string {
	...
	podCIDRs := make([]string, 0, len(node.Spec.PodCIDRs))
	if pc := node.Spec.PodCIDR; pc != "" {
		if len(node.Spec.PodCIDRs) > 0 && pc != node.Spec.PodCIDRs[0] {
			podCIDRs = append(podCIDRs, pc)
		}
	}
	podCIDRs = append(podCIDRs, node.Spec.PodCIDRs...)
	return &podCIDRs
}

控制器运行逻辑

控制器主要逻辑包含：消费队列中的事件数据、判断是否为节点类型事件、宣告当前节点PodCIDR信息

// pkg/bgp/speaker/events.go
// run runs the reconciliation loop, fetching events off of the queue to
// process. The events supported are svcEvent, epEvent, and nodeEvent. This
// loop is only stopped (implicitly) when the Agent is shutting down.
//
// Adapted from go.universe.tf/metallb/pkg/k8s/k8s.go.
func (s *MetalLBSpeaker) run(ctx context.Context) {
	...
	for {
		...
    // 消费队列数据
		key, quit := s.queue.Get()
		...
    // 数据处理
		st := s.do(key)
    // 处理报错的数据重新加入到队列
		switch st {
		case types.SyncStateError:
			s.queue.Add(key)
			// done must be called to requeue event after add.
		case types.SyncStateSuccess, types.SyncStateReprocessAll:
			// SyncStateReprocessAll is returned in MetalLB when the
			// configuration changes. However, we are not watching for
			// configuration changes because our configuration is static and
			// loaded once at Cilium start time.
		}
		// if queue.Add(key) is called previous to this invocation the event
		// is requeued, else it is discarded from the queue.
		s.queue.Done(key)
	}
}

// do performs the appropriate action depending on the event type. For example,
// if it is a service event (svcEvent), then it will call into MetalLB's
// SetService() to perform BGP announcements.
func (s *MetalLBSpeaker) do(key interface{}) types.SyncState {
	...
	switch k := key.(type) {
	case svcEvent:
		...
	case epEvent:
		...
	case nodeEvent:
		if s.Fence(k.Meta) {
			l.WithFields(logrus.Fields{
				"uuid":     k.Meta.UUID,
				"type":     "node",
				"revision": k.Meta.Rev,
			}).Debug("Encountered stale event, will not process")
			return types.SyncStateSuccess
		}
		st := s.handleNodeEvent(k)
		return st
	default:
		l.Debugf("Encountered an unknown key type %T in BGP speaker", k)
		return types.SyncStateSuccess
	}
}

func (s *MetalLBSpeaker) handleNodeEvent(k nodeEvent) types.SyncState {
	...

	l.Debug("syncing bgp sessions")
	...

  // 宣告PodCIDR到网络设备
	if s.announcePodCIDR {
		l.Debug("announcing pod CIDR(s)")
		if err := s.announcePodCIDRs(*k.podCIDRs); err != nil {
			l.WithError(err).WithField("CIDRs", k.podCIDRs).Error("Failed to announce pod CIDRs")
			return types.SyncStateError
		}
	}

	return types.SyncStateSuccess
}

结论

通过上面源码我们发现Cilium向网络设备宣告的BGP路由信息是从k8s Node资源的Spec.PodCIDRs获取的，这也解释了前面我们发现的三个问题。

以下是k8s集群所有节点PodCIDRs和Cilium实际给每个节点分配的Pod CIDR：

# kubectl  get node  -o custom-columns=NAME:.metadata.name,PodCIDR:.spec.podCIDRs
NAME      PodCIDR
master1   [10.5.112.0/24]
master2   [10.5.114.0/24]
master3   [10.5.113.0/24]
node01    [10.5.115.0/24]

root@master1:/home/cilium# cilium node list
Name      IPv4 Address   Endpoint CIDR   IPv6 Address   Endpoint CIDR
master1   10.4.94.32     10.5.113.0/24
master2   10.4.94.21     10.5.115.0/24
master3   10.4.94.31     10.5.114.0/24
node1     10.4.94.23     10.5.112.0/24

从上面信息我们终于发现了问题原因，导致网络设备上每个k8s节点宣告的路由信息与该节点实际pod网段不一致的原因是因为k8s Node上的PodCIDRs字段与Cilium给每个节点分配的PodCIDRs不一致。

关于为什么k8s Node上设置的PodCIDRs字段与Cilium给每个节点分配的PodCIDRs不一致，我们后续再继续排查。

Previous源码分析：Cilium Network Policy实现 Nextk8s实践：Cilium集群IPAM实现

Last updated 1 year ago