Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium deployment fails to pass conn test and sonobuoy #8546

Closed
ShiroDN opened this issue Feb 15, 2022 · 12 comments
Closed

Cilium deployment fails to pass conn test and sonobuoy #8546

ShiroDN opened this issue Feb 15, 2022 · 12 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ShiroDN
Copy link

ShiroDN commented Feb 15, 2022

Hi everyone,

I deployed a cluster with kubespray (master branch) with cilium cni which was the only change that I did to group_vars- kube_network_plugin: cilium.

I am not sure if there is some additional configuration that must be set in group_vars for kubespray to work with a cilium, but I have 3 problems with the default kubespray cilium deployment. I tested it in our DC and locally with Vagrant for verification, both fails on these 3 problems.

  1. cilium cli connectivity test is falling:
📋 Test Report
❌ 3/11 tests failed (5/118 actions), 0 tests skipped, 0 scenarios skipped:
Test [to-entities-world]:
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-7568bc7f86-fw9qz (10.233.66.23) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-lz2nn (10.233.66.184) -> one-one-one-one-http (one.one.one.one:80)
Test [client-egress-l7]:
  ❌ client-egress-l7/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-lz2nn (10.233.66.184) -> one-one-one-one-http (one.one.one.one:80)
Test [to-fqdns]:
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-7568bc7f86-fw9qz (10.233.66.23) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-lz2nn (10.233.66.184) -> one-one-one-one-http (one.one.one.one:80)
Connectivity test failed: 3 tests failed

Every failed test fails on curl:

  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://one.one.one.one:80" failed: command terminated with exit code 28
  ℹ️  curl output:
  curl: (28) Resolving timed out after 5000 milliseconds
:0 -> :0 = 000
  1. Cilium hubble ui does not work, in hubble-relay log:
2022-02-14T11:56:47+01:00 level=warning msg="Failed to create gRPC client" address="192.168.56.103:4244" error="connection error: desc = \"transport: error while dialing: dial tcp 192.168.56.103:4244: connect: connection refused\"" hubble-tls=true next-try-in=10s peer=k8s-3 subsys=hubble-relay
2022-02-14T11:56:47+01:00 level=warning msg="Failed to create gRPC client" address="192.168.56.102:4244" error="connection error: desc = \"transport: error while dialing: dial tcp 192.168.56.102:4244: connect: connection refused\"" hubble-tls=true next-try-in=10s peer=k8s-2 subsys=hubble-relay
2022-02-14T11:56:47+01:00 level=warning msg="Failed to create gRPC client" address="192.168.56.101:4244" error="connection error: desc = \"transport: error while dialing: dial tcp 192.168.56.101:4244: connect: connection refused\"" hubble-tls=true next-try-in=10s peer=k8s-1 subsys=hubble-relay

  1. Sonobuoy failed test:
    [sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]
    Sonobuoy focus log: https://gist.github.com/ShiroDN/6c1790f52fad9f4235579a05e8be6e05

Steps to reproduce:

  1. get kubespray
  2. in inventory file k8s-cluster.yml change kube_network_plugin: cilium
  3. run cluster.yml
  4. test with https://github.com/cilium/cilium-cli#connectivity-check, the test will fail
  5. enable hubble:
    $ cilium hubble enable --ui
  6. check hubble-relay pod logs
  7. run sonobuoy test
    sonobuoy run --mode=certified-conformance
    or you can run the failed test directly:
sonobuoy run --e2e-focus "HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol" --wait

Environment:

  • Cloud provider or hardware configuration: KVM VPS in our DC and Vagrant for verification

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):

Ubuntu 20.04.3 LTS with Linux 5.4.0-97-generic x86_64
  • Version of Ansible (ansible --version):
    ansible 2.10.15

  • Version of Python (python --version):
    Python 3.10.2

Kubespray version (commit) (git rev-parse --short HEAD):
da8522af

Network plugin used:
cilium

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

Command used to invoke ansible:

ansible-playbook -i inventory/kube01/hosts.yaml --become --become-user=root -K -u ubuntu cluster.yml

Output of ansible run:

Playbook completes without any fail.

Anything else do we need to know:

@ShiroDN ShiroDN added the kind/bug Categorizes issue or PR as related to a bug. label Feb 15, 2022
@necatican
Copy link
Contributor

Hello,
Thanks for bringing this to our attention. I will try to narrow the problem down. ❤️

@necatican
Copy link
Contributor

/assign

@necatican
Copy link
Contributor

Sorry for going dark for a while. I had a busy week. :) There's a similar issue on Cilium's issue board. However, disabling the host firewall didn't help me at all.

I've seen an iptables rule on the KUBE-FIREWALL chain flare up when running Sonobuoy tests. Kubelet service adds this rule automatically. Kubelet adds the same rule when using flannel/calico; however, those pass the same test without a problem.

Chain KUBE-FIREWALL (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    15     900 DROP       all  --  any    any    !localhost/8          localhost/8          /* block incoming localnet connections */ !

I've also tried using other Cilium versions but had the same results. I will try to install Cilium manually and solve the issue that way. Our Cilium files on Kubespray are somewhat dated. I will handle the necessary updates if I manage to find the problem.

@ShiroDN
Copy link
Author

ShiroDN commented Feb 24, 2022

It's actually even worse with manual install - kube_network_plugin: cni and installed with cilium cli, more tests are failing with cilium connectivity test and sonobuoy. Sorry, I forgot to mention it here, manual install was the second thing that I tried.

That issue on cillium issue board seems to be caused by fw, in this case, I had the host firewall disabled of course.

Test Report
❌ 5/11 tests failed (9/142 actions), 0 tests skipped, 0 scenarios skipped:
Test [no-policies]:
  ❌ no-policies/pod-to-local-nodeport/curl-0: cilium-test/client-7568bc7f86-gbwm9 (10.0.5.28) -> cilium-test/echo-other-node (echo-other-node:8080)
  ❌ no-policies/pod-to-local-nodeport/curl-2: cilium-test/client2-686d5f784b-8mkz7 (10.0.5.55) -> cilium-test/echo-other-node (echo-other-node:8080)
Test [allow-all]:
  ❌ allow-all/pod-to-local-nodeport/curl-0: cilium-test/client2-686d5f784b-8mkz7 (10.0.5.55) -> cilium-test/echo-other-node (echo-other-node:8080)
  ❌ allow-all/pod-to-local-nodeport/curl-2: cilium-test/client-7568bc7f86-gbwm9 (10.0.5.28) -> cilium-test/echo-other-node (echo-other-node:8080)
Test [to-entities-world]:
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-7568bc7f86-gbwm9 (10.0.5.28) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-8mkz7 (10.0.5.55) -> one-one-one-one-http (one.one.one.one:80)
Test [client-egress-l7]:
  ❌ client-egress-l7/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-8mkz7 (10.0.5.55) -> one-one-one-one-http (one.one.one.one:80)
Test [to-fqdns]:
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-7568bc7f86-gbwm9 (10.0.5.28) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-8mkz7 (10.0.5.55) -> one-one-one-one-http (one.one.one.one:80)
Connectivity test failed: 5 tests failed

Just running sonobuoy on manual install and there are already 3 failures (86/346).

@ShiroDN
Copy link
Author

ShiroDN commented Feb 24, 2022

Sonobuoy results with manual install:

Summarizing 6 Failures:

[Fail] [sig-network] Services [It] should be able to switch session affinity for NodePort service [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:2960

[Fail] [sig-network] HostPort [It] validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113

[Fail] [sig-network] Services [It] should have session affinity work for service with type clusterIP [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:209

[Fail] [sig-network] Services [It] should have session affinity work for NodePort service [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:2960

[Fail] [sig-network] Services [It] should be able to switch session affinity for service with type clusterIP [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:209

[Fail] [sig-network] Services [It] should have session affinity timeout work for service with type clusterIP [LinuxOnly] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/service.go:209

Ran 346 of 7042 Specs in 6776.440 seconds
FAIL! -- 340 Passed | 6 Failed | 0 Pending | 6696 Skipped
--- FAIL: TestE2E (6779.09s)
FAIL

@necatican
Copy link
Contributor

Hello,
Just a quick update, we've been working on this issue (and some other cases) with @eminaktas.

We've managed to pass all tests with our manual installation. However, there are way too many variables flying around, so we don't know the minimum required change.

Some variables aren't present in Kubespary, and we will add those to Kubespray.

@eminaktas
Copy link
Contributor

/assign

@eminaktas
Copy link
Contributor

We (@necatican and I) applied these steps to do a clean install. With the below steps, you can install cilium without kube-proxy and Host Routing as eBPF

  • Install cluster without cni and kube-proxy
    • Set kube_network_plugin: cni
    • Set kube_proxy_remove: true
  • When Cluster installation is done, you can install cilium with the below commands.
    • PodCIDR is located here
    • MaskSize is here
    helm repo add cilium https://helm.cilium.io/
    export REPLACE_WITH_API_SERVER_IP=<api-server-ip-address>
    export REPLACE_WITH_API_SERVER_PORT=<api-server-port-number>
    helm install cilium cilium/cilium --version 1.11.2 \
        --namespace kube-system \
        --set kubeProxyReplacement=strict \
        --set k8sServiceHost=$REPLACE_WITH_API_SERVER_IP \
        --set k8sServicePort=$REPLACE_WITH_API_SERVER_PORT \
        --set bpf.masquerade=true \
        --set ipam.mode=cluster-pool \
        --set ipam.operator.clusterPoolIPv4PodCIDR=<PodCIDR> \
        --set ipam.operator.clusterPoolIPv4MaskSize=<MaskSize>
  • Then, you can run cilium's connectivity test. You can use here for connectivity test docs.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 2, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants