.------------------------------------. .------------------------------------.
| openstack cluster | | k8s cluster |
| tt-ost2-ko | | tt-k8st2-ko |
| worker | | worker |
| tt-lab3.ko.iszn.cz | | tt-lab9.ko.iszn.cz |
| .-------------. .-------------. | | .---------------..----------------.|
| | vm1 | | vm2 | | | | pod1 || pod2 ||
| | 10.247.2.19 | | 10.247.2.20 | | | | 10.247.144.77 || 10.247.144.152 ||
| '------|------' '------|------' | | '------|--------''-------|--------'|
|.-------v--------..-------v--------.| |.-------v--------..-------v--------.|
|| tape4f9e219-e1 || tap853025d2-98 || || lxc5b048b77e28 || lxca302b9e65ee ||
|'-------|--------''-------|--------'| |'-------|--------''-------|--------'|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | .-----. .-----. | | |
| '-----------------'---->| NIC |-----overlay----| NIC <-----------------------' |
|"powered" by cilium '-----' '-----' powered by cilium |
'------------------------------------' '------------------------------------'
root@tt-lab3:~# cilium status --verbose | more
KVStore: Ok etcd: 1/1 connected, lease-ID=34f27840cc3acfa9, lock lease-ID=34f27840cc3acfab, has-quorum=true: https://10.248.14.20:4379 - 3.3.12 (Leader)
Kubernetes: Ok 1.19 (v1.19.7+k3s1) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumNetworkPolicy", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1beta1::EndpointSlice", "networking.k8s.io/v1$
:NetworkPolicy"]
KubeProxyReplacement: Partial
Cilium: Ok OK
NodeMonitor: Disabled
Cilium health daemon: Ok
IPAM: IPv4: 2/255 allocated from 10.247.2.0/24,
Allocated addresses:
10.247.2.183 (health)
10.247.2.95 (router)
ClusterMesh: 1/1 clusters ready, 0 global-services
tt-k8st2-ko.conf: ready, 0 nodes, 11 identities, 0 services, 0 failures (last: never)
└ etcd: 3/3 connected, lease-ID=158c77439c7b75a0, lock lease-ID=2ed1771796079eaf, has-quorum=true: https://tt-lab11.ko.iszn.cz:2479 - 3.4.3 (Leader); https://tt-lab9.ko.iszn.cz:2479 - 3.4.3; https://tt-lab10.ko.iszn.cz:2479 - 3.4.3
BandwidthManager: Disabled
Masquerading: Disabled
Controller Status: 62/62 healthy
Proxy Status: No managed proxy redirect
Hubble: Disabled
KubeProxyReplacement Details:
Status: Partial
Protocols: TCP, UDP
Session Affinity: Disabled
Services:
- ClusterIP: Enabled
- NodePort: Disabled
- LoadBalancer: Disabled
- externalIPs: Disabled
- HostPort: Disabled
BPF Maps: dynamic sizing: off
Name Size
Non-TCP connection tracking 262144
TCP connection tracking 524288
Endpoint policy 65535
Events 40
IP cache 512000
IP masquerading agent 16384
IPv4 fragmentation 8192
IPv4 service 65536
IPv6 service 65536
IPv4 service backend 65536
IPv6 service backend 65536
IPv4 service reverse NAT 65536
IPv6 service reverse NAT 65536
Metrics 1024
NAT 524288
Neighbor table 524288
Global policy 16384
Per endpoint policy 65536
Session affinity 65536
Signal 40
Sockmap 65535
Sock reverse NAT 262144
Tunnel 65536
Cluster health: 4/4 reachable (2021-03-23T14:13:11Z)
Name IP Node Endpoints
tt-ost2-ko/tt-lab3.ko.iszn.cz (localhost) 10.248.14.26 reachable reachable
tt-ost2-ko/tt-lab4.ko.iszn.cz 10.248.14.24 reachable reachable
tt-ost2-ko/tt-lab5.ko.iszn.cz 10.248.14.22 reachable reachable
tt-ost2-ko/tt-lab6.ko.iszn.cz 10.248.14.20 reachable reachable
ExecStart=/opt/cilium/sbin/cilium-agent \
--enable-l7-proxy=false \
--disable-envoy-version-check=true\
--enable-remote-node-identity \
--k8s-kubeconfig-path /opt/cilium/conf/kubeconfig.yaml \
--enable-ipv6=false \
--prometheus-serve-addr=":9099" \
--enable-host-reachable-services=true \
--enable-endpoint-routes=true \
--enable-local-node-route=false \
--masquerade=false \
--kvstore etcd \
--kvstore-opt etcd.config=/opt/cilium/conf/etcd.config \
--clustermesh-config /opt/cilium/conf/clusters \
--cluster-id {{ cluster_id }} \
--cluster-name {{ cluster_name }}
There are currently 4 nodes in the openstack cluster, each having InternalIP from range 10.248.14.0/24. Openstack networking (including IPAM) is for now provided by calico so this works as in chaining mode. Bird is used for bgp announcing since each vm has routable IP address.
The problem with calico in this setup is that it doesn't use per worker CIDR or anything that cilium usually uses. It allocates IPs from one network (selected subnet) to every vm spawned anywhere in the cluster. This can be a problem here when cilium-agent tries to insert node CIDRs of each remote node into routing table.
func (n *linuxNodeHandler) updateNodeRoute(prefix *cidr.CIDR, addressFamilyEnabled bool, isLocalNode bool) error {
if prefix == nil || !addressFamilyEnabled {
return nil
}
_, err := n.createNodeRouteSpec(prefix, isLocalNode)
if err != nil {
return err
}
if _, err := route.Upsert(nodeRoute); err != nil {
log.WithError(err).WithFields(nodeRoute.LogFields()).Warning("Unable to update route")
return err
}
return nil
}
This results in having something like 10.247.2.0/24 via 10.247.2.95 dev cilium_host src 10.247.2.95 mtu 145
in routing table on each node.
# ip route
root@tt-lab3:~# ip r
default via 10.248.14.1 dev eth0 onlink
10.247.2.0/24 via 10.247.2.95 dev cilium_host src 10.247.2.95 mtu 1450
10.247.2.19 dev tape4f9e219-e1 scope link
10.247.2.20 dev tap853025d2-98 scope link
10.247.2.95 dev cilium_host scope link
10.247.2.183 dev lxc_health scope link
10.248.14.0/24 dev eth0 proto kernel scope link src 10.248.14.26
Each node also contains direct routes (above) for each instance like 10.247.2.19 dev tape4f9e219-e1 scope link
which is inserted into the routing table by calico-felix (deamon running on every node).
All this becomes a problem when remote instance (vm1) tries to talk to local node (tt-lab5.ko.iszn.cz) as in this picture:
.---------------------------------------------------------------------------------------------------.
| openstack cluster |
| tt-ost2-ko |
| |
| .---------------------------------. .---------------------------------. |
| .-------------. .-------------. | .-------------. .-------------. | |
| | vm1 | | vm2 | | | vmX | | vmY | | |
| | 10.247.2.19 | | 10.247.2.20 | | | | | | | |
| '------|------' '------|------' | '------|------' '------|------' | |
|.-------v--------..-------v--------. .-------v--------..-------v--------. |
|| tape4f9e219-e1 || tap853025d2-98 | | tape4f9e219-e1 || tap853025d2-98 | |
|'-------|--------''----------------' '----------------''----------------' |
| | | | | | |
| | | | | .---------------------.| |
| | | | | | http server || |
| | | | | .-->| running on the host || |
| | | .-----. .-----. | '---------------------'| |
| | '---------------------->| NIC |------------->| NIC |---- | |
| |node "tt-lab3.ko.iszn.cz" '-----' '-----' node "tt-lab5.ko.iszn.cz"| |
| '---------------------------------' '---------------------------------' |
| |
| root@tt-lab3:~# ip r root@tt-lab5:~# ip r |
| default via 10.248.14.1 dev eth0 onlink default via 10.248.14.1 dev eth0 onlink |
| 10.247.2.0/24 via 10.247.2.95 dev cilium_host 10.247.1.27 dev tap535fa509-a6 scope link |
| src 10.247.2.95 mtu 1450 10.247.2.0/24 via 10.247.2.135 dev cilium_host|
| 10.247.2.19 dev tape4f9e219-e1 scope link src 10.247.2.135 mtu 1450 |
| 10.247.2.20 dev tap853025d2-98 scope link 10.247.2.11 dev tap4355f2bb-33 scope link |
| 10.247.2.95 dev cilium_host scope link 10.247.2.12 dev tap8cd570f1-67 scope link |
| 10.247.2.183 dev lxc_health scope link 10.247.2.29 dev tapbee1339f-84 scope link |
| 10.248.14.0/24 dev eth0 proto kernel scope 10.247.2.135 dev cilium_host scope link |
| link src 10.248.14.26 10.247.2.218 dev lxc_health scope link |
| 10.248.14.0/24 dev eth0 proto kernel scope |
| link src 10.248.14.22 |
'---------------------------------------------------------------------------------------------------'
Anyway, if updateNodeRoute
function is modified to not add the CIDR route 10.247.2.0/24 via 10.247.2.95
everything starts to work like a charm.
With this I would like to open a discussion on what should be the right way to make this work.