Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[occm] dnsPolicy default value recent change blocks occm / cluster from coming up. #2611

Closed
ericgraf opened this issue May 31, 2024 · 9 comments · Fixed by #2621
Closed

[occm] dnsPolicy default value recent change blocks occm / cluster from coming up. #2611

ericgraf opened this issue May 31, 2024 · 9 comments · Fixed by #2621
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ericgraf
Copy link

[occm] Recent change #2594 changed the default behaviour of dnspolicy breaking the start up process of occm when coredns has not yet been started.

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

Default value change of dnspolicy broke the bootstrapping of occm on new clusters.

The default value of dnspolicy is Default the change #2594 sets it to ClusterFirstWithHostNet.

If coredns is pending waiting for occm to set nodes to initialized this creates issue where occm fails and won't start blocking creation of new clusters.

F0531 14:46:10.225237      10 main.go:71] Cloud provider could not be initialized: could not init cloud provider "openstack": Get "https://<cloud provider dns>:5000/": dial tcp: lookup <cloud provider dns> on 10.96.0.10:53: write udp 10.6.0.240:32899->10.96.0.10:53: write: operation not permitted

What you expected to happen:

Default behaviour not to change.

How to reproduce it:
Create a new cluster and use the default values of occm chart or the manifest.

Anything else we need to know?:

Environment:

  • openstack-cloud-controller-manager(or other related binary) version:
  • OpenStack version:
  • Others:
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 31, 2024
@mdbooth
Copy link
Contributor

mdbooth commented May 31, 2024

The documentation of this field is quite spectacularly bad: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy

"ClusterFirstWithHostNet": For Pods running with hostNetwork, you should explicitly set its DNS policy to "ClusterFirstWithHostNet". Otherwise, Pods running with hostNetwork and "ClusterFirst" will fallback to the behavior of the "Default" policy.

To my eyes that clearly says that pods running with host networking should set ClusterFirstWithHostNet. CCM runs with host networking, so we'd seem to be covered. It doesn't actually say what it does, though.

Some RTFS:
https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/kubelet/network/dns/dns.go#L303-L323

For reasons I don't understand, most likely legacy API compatibility, ClusterFirst sets the DNS policy to Default if the pod uses host networking. Internally that is mapped to a podDNSType of podDNSHost, which is a much better name. i.e. ClusterFirst means:

Use cluster networking unless the pod uses host networking, in which case use the host's DNS.

ClusterFirstWithHostNet removes the fallthrough and uses ClusterFirst in all cases.

As an early bootstrap service, I don't think CCM can rely on cluster DNS being up. I suspect it is correct to revert this.

@xinity, what was the issue you were hitting which caused you to change it? It's not clear to me from reading #2594 or #2592? I appreciate that CCM is not able to resolve service names internal to the cluster, but why did that matter?

@xinity
Copy link
Contributor

xinity commented May 31, 2024

@mdbooth occm wasn't able to query the internal coredns instance without this new value

It matters because of specific internal dns zone with squid proxy that should be resolved from occm

@mdbooth
Copy link
Contributor

mdbooth commented May 31, 2024

@mdbooth occm wasn't able to query the internal coredns instance without this new value

It matters because of specific internal dns zone with squid proxy that should be resolved from occm

Right, but why? What was the internal DNS zone, and why was it important that CCM could resolve it?

@jichenjc
Copy link
Contributor

jichenjc commented Jun 3, 2024

our CI passed so it should be a smaller portion of error case

and I am also curious why the internal DNS zone is needed here

@mdbooth
Copy link
Contributor

mdbooth commented Jun 3, 2024

our CI passed so it should be a smaller portion of error case

I also wondered about that. Does that mean CNI comes up an an uninitialized node, and coredns tolerates uninitialised?

@yankcrime
Copy link
Contributor

yankcrime commented Jun 21, 2024

I've just tested the new release of OCCM on a cluster with 1.30 and have hit this issue as well. For anyone else who is struggling to understand the root cause (being this change), the nondescript error from the CCM Pod is:

Error from server: no preferred addresses found; known addresses: []

I only found the underlying error when I SSH'd onto the node where the CCM had been scheduled and looked at the container logs in /var/log/containers.

@zetaab
Copy link
Member

zetaab commented Jul 2, 2024

@jichenjc @mdbooth the occm test was not really executed in #2594 that is why it did not fail. Only helm chart tests were executed. Like now we can see, our CI is basically broken. So yes, this change should be reverted

testing CI in #2620 and also PR #2621 which will set the default value back what it was.

@zetaab
Copy link
Member

zetaab commented Jul 2, 2024

blah at least our CI do work with this new value.. should we still use old default value just in case?

@jichenjc
Copy link
Contributor

jichenjc commented Jul 3, 2024

blah at least our CI do work with this new value.. should we still use old default value just in case?

yes, I think we may still use the old default value ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
7 participants