[occm] dnsPolicy default value recent change blocks occm / cluster from coming up. #2611

ericgraf · 2024-05-31T15:05:14Z

[occm] Recent change #2594 changed the default behaviour of dnspolicy breaking the start up process of occm when coredns has not yet been started.

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

Default value change of dnspolicy broke the bootstrapping of occm on new clusters.

The default value of dnspolicy is Default the change #2594 sets it to ClusterFirstWithHostNet.

If coredns is pending waiting for occm to set nodes to initialized this creates issue where occm fails and won't start blocking creation of new clusters.

F0531 14:46:10.225237      10 main.go:71] Cloud provider could not be initialized: could not init cloud provider "openstack": Get "https://<cloud provider dns>:5000/": dial tcp: lookup <cloud provider dns> on 10.96.0.10:53: write udp 10.6.0.240:32899->10.96.0.10:53: write: operation not permitted

What you expected to happen:

Default behaviour not to change.

How to reproduce it:
Create a new cluster and use the default values of occm chart or the manifest.

Anything else we need to know?:

Environment:

openstack-cloud-controller-manager(or other related binary) version:
OpenStack version:
Others:

The text was updated successfully, but these errors were encountered:

mdbooth · 2024-05-31T20:36:35Z

The documentation of this field is quite spectacularly bad: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy

"ClusterFirstWithHostNet": For Pods running with hostNetwork, you should explicitly set its DNS policy to "ClusterFirstWithHostNet". Otherwise, Pods running with hostNetwork and "ClusterFirst" will fallback to the behavior of the "Default" policy.

To my eyes that clearly says that pods running with host networking should set ClusterFirstWithHostNet. CCM runs with host networking, so we'd seem to be covered. It doesn't actually say what it does, though.

Some RTFS:
https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/kubelet/network/dns/dns.go#L303-L323

For reasons I don't understand, most likely legacy API compatibility, ClusterFirst sets the DNS policy to Default if the pod uses host networking. Internally that is mapped to a podDNSType of podDNSHost, which is a much better name. i.e. ClusterFirst means:

Use cluster networking unless the pod uses host networking, in which case use the host's DNS.

ClusterFirstWithHostNet removes the fallthrough and uses ClusterFirst in all cases.

As an early bootstrap service, I don't think CCM can rely on cluster DNS being up. I suspect it is correct to revert this.

@xinity, what was the issue you were hitting which caused you to change it? It's not clear to me from reading #2594 or #2592? I appreciate that CCM is not able to resolve service names internal to the cluster, but why did that matter?

xinity · 2024-05-31T20:39:16Z

@mdbooth occm wasn't able to query the internal coredns instance without this new value

It matters because of specific internal dns zone with squid proxy that should be resolved from occm

mdbooth · 2024-05-31T20:40:49Z

@mdbooth occm wasn't able to query the internal coredns instance without this new value

It matters because of specific internal dns zone with squid proxy that should be resolved from occm

Right, but why? What was the internal DNS zone, and why was it important that CCM could resolve it?

jichenjc · 2024-06-03T03:00:09Z

our CI passed so it should be a smaller portion of error case

and I am also curious why the internal DNS zone is needed here

mdbooth · 2024-06-03T13:52:48Z

our CI passed so it should be a smaller portion of error case

I also wondered about that. Does that mean CNI comes up an an uninitialized node, and coredns tolerates uninitialised?

yankcrime · 2024-06-21T14:23:07Z

I've just tested the new release of OCCM on a cluster with 1.30 and have hit this issue as well. For anyone else who is struggling to understand the root cause (being this change), the nondescript error from the CCM Pod is:

Error from server: no preferred addresses found; known addresses: []

I only found the underlying error when I SSH'd onto the node where the CCM had been scheduled and looked at the container logs in /var/log/containers.

zetaab · 2024-07-02T18:07:30Z

@jichenjc @mdbooth the occm test was not really executed in #2594 that is why it did not fail. Only helm chart tests were executed. Like now we can see, our CI is basically broken. So yes, this change should be reverted

testing CI in #2620 and also PR #2621 which will set the default value back what it was.

zetaab · 2024-07-02T18:52:20Z

blah at least our CI do work with this new value.. should we still use old default value just in case?

jichenjc · 2024-07-03T02:20:31Z

blah at least our CI do work with this new value.. should we still use old default value just in case?

yes, I think we may still use the old default value ..

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 31, 2024

ericgraf mentioned this issue May 31, 2024

[occm] Revert dnsPolicy to Default for occm #2612

Closed

zetaab mentioned this issue Jul 2, 2024

[occm] Revert dnsPolicy to Default for occm #2621

Merged

k8s-ci-robot closed this as completed in #2621 Jul 12, 2024

drew-viles mentioned this issue Jul 30, 2024

adding the skipPhases option to the cluster definitions, structs and openapi schema unikorn-cloud/kubernetes#113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[occm] dnsPolicy default value recent change blocks occm / cluster from coming up. #2611

[occm] dnsPolicy default value recent change blocks occm / cluster from coming up. #2611

ericgraf commented May 31, 2024

mdbooth commented May 31, 2024

xinity commented May 31, 2024 •

edited

Loading

mdbooth commented May 31, 2024 •

edited

Loading

jichenjc commented Jun 3, 2024

mdbooth commented Jun 3, 2024

yankcrime commented Jun 21, 2024 •

edited

Loading

zetaab commented Jul 2, 2024 •

edited

Loading

zetaab commented Jul 2, 2024

jichenjc commented Jul 3, 2024 •

edited

Loading

[occm] dnsPolicy default value recent change blocks occm / cluster from coming up. #2611

[occm] dnsPolicy default value recent change blocks occm / cluster from coming up. #2611

Comments

ericgraf commented May 31, 2024

mdbooth commented May 31, 2024

xinity commented May 31, 2024 • edited Loading

mdbooth commented May 31, 2024 • edited Loading

jichenjc commented Jun 3, 2024

mdbooth commented Jun 3, 2024

yankcrime commented Jun 21, 2024 • edited Loading

zetaab commented Jul 2, 2024 • edited Loading

zetaab commented Jul 2, 2024

jichenjc commented Jul 3, 2024 • edited Loading

xinity commented May 31, 2024 •

edited

Loading

mdbooth commented May 31, 2024 •

edited

Loading

yankcrime commented Jun 21, 2024 •

edited

Loading

zetaab commented Jul 2, 2024 •

edited

Loading

jichenjc commented Jul 3, 2024 •

edited

Loading