Cilium deployment fails to pass conn test and sonobuoy #8546

ShiroDN · 2022-02-15T13:36:38Z

Hi everyone,

I deployed a cluster with kubespray (master branch) with cilium cni which was the only change that I did to group_vars- kube_network_plugin: cilium.

I am not sure if there is some additional configuration that must be set in group_vars for kubespray to work with a cilium, but I have 3 problems with the default kubespray cilium deployment. I tested it in our DC and locally with Vagrant for verification, both fails on these 3 problems.

cilium cli connectivity test is falling:

📋 Test Report
❌ 3/11 tests failed (5/118 actions), 0 tests skipped, 0 scenarios skipped:
Test [to-entities-world]:
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-7568bc7f86-fw9qz (10.233.66.23) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-lz2nn (10.233.66.184) -> one-one-one-one-http (one.one.one.one:80)
Test [client-egress-l7]:
  ❌ client-egress-l7/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-lz2nn (10.233.66.184) -> one-one-one-one-http (one.one.one.one:80)
Test [to-fqdns]:
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-7568bc7f86-fw9qz (10.233.66.23) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-lz2nn (10.233.66.184) -> one-one-one-one-http (one.one.one.one:80)
Connectivity test failed: 3 tests failed

Every failed test fails on curl:

  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://one.one.one.one:80" failed: command terminated with exit code 28
  ℹ️  curl output:
  curl: (28) Resolving timed out after 5000 milliseconds
:0 -> :0 = 000

Cilium hubble ui does not work, in hubble-relay log:

2022-02-14T11:56:47+01:00 level=warning msg="Failed to create gRPC client" address="192.168.56.103:4244" error="connection error: desc = \"transport: error while dialing: dial tcp 192.168.56.103:4244: connect: connection refused\"" hubble-tls=true next-try-in=10s peer=k8s-3 subsys=hubble-relay
2022-02-14T11:56:47+01:00 level=warning msg="Failed to create gRPC client" address="192.168.56.102:4244" error="connection error: desc = \"transport: error while dialing: dial tcp 192.168.56.102:4244: connect: connection refused\"" hubble-tls=true next-try-in=10s peer=k8s-2 subsys=hubble-relay
2022-02-14T11:56:47+01:00 level=warning msg="Failed to create gRPC client" address="192.168.56.101:4244" error="connection error: desc = \"transport: error while dialing: dial tcp 192.168.56.101:4244: connect: connection refused\"" hubble-tls=true next-try-in=10s peer=k8s-1 subsys=hubble-relay

Sonobuoy failed test:
[sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]
Sonobuoy focus log: https://gist.github.com/ShiroDN/6c1790f52fad9f4235579a05e8be6e05

Steps to reproduce:

get kubespray
in inventory file k8s-cluster.yml change kube_network_plugin: cilium
run cluster.yml
test with https://github.com/cilium/cilium-cli#connectivity-check, the test will fail
enable hubble:
$ cilium hubble enable --ui
check hubble-relay pod logs
run sonobuoy test
sonobuoy run --mode=certified-conformance
or you can run the failed test directly:

sonobuoy run --e2e-focus "HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol" --wait

Environment:

Cloud provider or hardware configuration: KVM VPS in our DC and Vagrant for verification
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):

Ubuntu 20.04.3 LTS with Linux 5.4.0-97-generic x86_64

Version of Ansible (ansible --version):
ansible 2.10.15
Version of Python (python --version):
Python 3.10.2

Kubespray version (commit) (git rev-parse --short HEAD):
da8522af

Network plugin used:
cilium

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

Command used to invoke ansible:

ansible-playbook -i inventory/kube01/hosts.yaml --become --become-user=root -K -u ubuntu cluster.yml

Output of ansible run:

Playbook completes without any fail.

Anything else do we need to know:

The text was updated successfully, but these errors were encountered:

necatican · 2022-02-17T12:37:15Z

Hello,
Thanks for bringing this to our attention. I will try to narrow the problem down. ❤️

necatican · 2022-02-17T12:52:09Z

/assign

necatican · 2022-02-24T07:58:32Z

Sorry for going dark for a while. I had a busy week. :) There's a similar issue on Cilium's issue board. However, disabling the host firewall didn't help me at all.

I've seen an iptables rule on the KUBE-FIREWALL chain flare up when running Sonobuoy tests. Kubelet service adds this rule automatically. Kubelet adds the same rule when using flannel/calico; however, those pass the same test without a problem.

Chain KUBE-FIREWALL (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    15     900 DROP       all  --  any    any    !localhost/8          localhost/8          /* block incoming localnet connections */ !

I've also tried using other Cilium versions but had the same results. I will try to install Cilium manually and solve the issue that way. Our Cilium files on Kubespray are somewhat dated. I will handle the necessary updates if I manage to find the problem.

ShiroDN · 2022-02-24T10:52:13Z

It's actually even worse with manual install - kube_network_plugin: cni and installed with cilium cli, more tests are failing with cilium connectivity test and sonobuoy. Sorry, I forgot to mention it here, manual install was the second thing that I tried.

That issue on cillium issue board seems to be caused by fw, in this case, I had the host firewall disabled of course.

Test Report
❌ 5/11 tests failed (9/142 actions), 0 tests skipped, 0 scenarios skipped:
Test [no-policies]:
  ❌ no-policies/pod-to-local-nodeport/curl-0: cilium-test/client-7568bc7f86-gbwm9 (10.0.5.28) -> cilium-test/echo-other-node (echo-other-node:8080)
  ❌ no-policies/pod-to-local-nodeport/curl-2: cilium-test/client2-686d5f784b-8mkz7 (10.0.5.55) -> cilium-test/echo-other-node (echo-other-node:8080)
Test [allow-all]:
  ❌ allow-all/pod-to-local-nodeport/curl-0: cilium-test/client2-686d5f784b-8mkz7 (10.0.5.55) -> cilium-test/echo-other-node (echo-other-node:8080)
  ❌ allow-all/pod-to-local-nodeport/curl-2: cilium-test/client-7568bc7f86-gbwm9 (10.0.5.28) -> cilium-test/echo-other-node (echo-other-node:8080)
Test [to-entities-world]:
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-7568bc7f86-gbwm9 (10.0.5.28) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-8mkz7 (10.0.5.55) -> one-one-one-one-http (one.one.one.one:80)
Test [client-egress-l7]:
  ❌ client-egress-l7/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-8mkz7 (10.0.5.55) -> one-one-one-one-http (one.one.one.one:80)
Test [to-fqdns]:
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-7568bc7f86-gbwm9 (10.0.5.28) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-8mkz7 (10.0.5.55) -> one-one-one-one-http (one.one.one.one:80)
Connectivity test failed: 5 tests failed

Just running sonobuoy on manual install and there are already 3 failures (86/346).

ShiroDN · 2022-02-24T12:54:22Z

Sonobuoy results with manual install:

Summarizing 6 Failures:

[Fail] [sig-network] Services [It] should be able to switch session affinity for NodePort service [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:2960

[Fail] [sig-network] HostPort [It] validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113

[Fail] [sig-network] Services [It] should have session affinity work for service with type clusterIP [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:209

[Fail] [sig-network] Services [It] should have session affinity work for NodePort service [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:2960

[Fail] [sig-network] Services [It] should be able to switch session affinity for service with type clusterIP [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:209

[Fail] [sig-network] Services [It] should have session affinity timeout work for service with type clusterIP [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:209

Ran 346 of 7042 Specs in 6776.440 seconds
FAIL! -- 340 Passed | 6 Failed | 0 Pending | 6696 Skipped
--- FAIL: TestE2E (6779.09s)
FAIL

necatican · 2022-03-04T15:52:18Z

Hello,
Just a quick update, we've been working on this issue (and some other cases) with @eminaktas.

We've managed to pass all tests with our manual installation. However, there are way too many variables flying around, so we don't know the minimum required change.

Some variables aren't present in Kubespary, and we will add those to Kubespray.

eminaktas · 2022-03-04T16:03:05Z

/assign

eminaktas · 2022-03-04T16:17:48Z

We (@necatican and I) applied these steps to do a clean install. With the below steps, you can install cilium without kube-proxy and Host Routing as eBPF

Install cluster without cni and kube-proxy
- Set kube_network_plugin: cni
- Set kube_proxy_remove: true

When Cluster installation is done, you can install cilium with the below commands.

PodCIDR is located here
MaskSize is here

helm repo add cilium https://helm.cilium.io/
export REPLACE_WITH_API_SERVER_IP=<api-server-ip-address>
export REPLACE_WITH_API_SERVER_PORT=<api-server-port-number>
helm install cilium cilium/cilium --version 1.11.2 \
    --namespace kube-system \
    --set kubeProxyReplacement=strict \
    --set k8sServiceHost=$REPLACE_WITH_API_SERVER_IP \
    --set k8sServicePort=$REPLACE_WITH_API_SERVER_PORT \
    --set bpf.masquerade=true \
    --set ipam.mode=cluster-pool \
    --set ipam.operator.clusterPoolIPv4PodCIDR=<PodCIDR> \
    --set ipam.operator.clusterPoolIPv4MaskSize=<MaskSize>

Then, you can run cilium's connectivity test. You can use here for connectivity test docs.

k8s-triage-robot · 2022-06-02T16:40:15Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-07-02T16:49:06Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-08-01T17:45:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-08-01T17:45:26Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ShiroDN added the kind/bug Categorizes issue or PR as related to a bug. label Feb 15, 2022

k8s-ci-robot assigned necatican Feb 17, 2022

k8s-ci-robot assigned eminaktas Mar 4, 2022

necatican mentioned this issue May 3, 2022

Overhaul Cilium manifests to match the newer versions #8717

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 2, 2022

k8s-ci-robot closed this as completed Aug 1, 2022

tnorlin mentioned this issue Feb 24, 2023

[Cilium 1.12.1][Kubernetes 1.25.0][sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance] cilium/cilium#21060

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cilium deployment fails to pass conn test and sonobuoy #8546

Cilium deployment fails to pass conn test and sonobuoy #8546

ShiroDN commented Feb 15, 2022

necatican commented Feb 17, 2022

necatican commented Feb 17, 2022

necatican commented Feb 24, 2022

ShiroDN commented Feb 24, 2022

ShiroDN commented Feb 24, 2022

necatican commented Mar 4, 2022

eminaktas commented Mar 4, 2022

eminaktas commented Mar 4, 2022

k8s-triage-robot commented Jun 2, 2022

k8s-triage-robot commented Jul 2, 2022

k8s-triage-robot commented Aug 1, 2022

k8s-ci-robot commented Aug 1, 2022

Cilium deployment fails to pass conn test and sonobuoy #8546

Cilium deployment fails to pass conn test and sonobuoy #8546

Comments

ShiroDN commented Feb 15, 2022

necatican commented Feb 17, 2022

necatican commented Feb 17, 2022

necatican commented Feb 24, 2022

ShiroDN commented Feb 24, 2022

ShiroDN commented Feb 24, 2022

necatican commented Mar 4, 2022

eminaktas commented Mar 4, 2022

eminaktas commented Mar 4, 2022

k8s-triage-robot commented Jun 2, 2022

k8s-triage-robot commented Jul 2, 2022

k8s-triage-robot commented Aug 1, 2022

k8s-ci-robot commented Aug 1, 2022