-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cinder-csi-plugin] Apparently-completed connections to Keystone and Cinder fail with "i/o timeout" #1387
Comments
relates to #1207 |
Proposal:
|
just to confirm my understanding , the normal code works fine but with debug logging enabled then time to time the timeout issue occurs then lead to retry? thus the suggestion is to control the timeout of transport etc? |
No, the code produces the frequent-failure condition at all times, and will return I'm hopeful that @kayrus suggestion of improving visibility into the HTTP client and allowing tuning of the transport timeouts will allow me to both see in more detail what's causing this error and tune the plugin to avoid it. |
ok, let's wait for additional help and comments from @kayrus :), thanks for the info~ |
@jralbert Thanks for reporting the issue .
|
Hi @ramineni, thanks for following up! Answering in order:
|
@jralbert Thanks for the info. |
@ramineni We're installing from the Helm chart, and the livenessprobe is not configurable from values, so we've been running using the settings which are defined in the chart (https://github.com/kubernetes/cloud-provider-openstack/blob/master/charts/cinder-csi-plugin/templates/controllerplugin-statefulset.yaml#L114 and https://github.com/kubernetes/cloud-provider-openstack/blob/master/charts/cinder-csi-plugin/templates/nodeplugin-daemonset.yaml#L81). I pulled a local copy of the chart to install from, and manually updated the templates to just add a zero after each of the If you're willing to indulge me a bit further, I'd love to understand why that's happened: my understanding is that the livenessprobe checks if the CSI socket is ready, and if not triggers a restart of the container; but what I saw was the plugin program itself throwing errors and eventually exiting with a non-zero RC. Having changed the livenessprobe settings, I no longer see the errors, and the plugin runs without exiting. Am I misunderstanding the relationship between these components? As a separate point, it would be wonderful to be able to configure the livenessprobe timeouts from Helm chart values; if that's something the project maintainers would consider, I'd be happy to create a pull request to implement it. |
I think it's reasonable to make them configurable , thanks for taking this
https://github.com/kubernetes-csi/livenessprobe is the livenessprobe check, I am curious as well, as it will do the check of the CSI container itself and why it's related to openstack call seems weird to me |
@jralbert That's a miss from our side then. values should be configurable.
@jralbert In addition it also checks if plugin able to connect to openstack cinder as well with basic api call . If not reachable, plugin will be considered in not-ready/healthy state.
ya, please do :) Appreciate it. |
@ramineni Just planning for the PR, is it appropriate to add these configurable values to the csi.livenessprobe dict in values.yml, or would you prefer they be organized elsewhere in the file? This will be my first contribution to a Kubernetes project, and I understand there's a process to follow, so please excuse what will no doubt be many questions. :) |
I think use value is a good way, as you can see |
@jralbert Feel free to propose a PR which you think the best way. I'm ok with adding them to csi.livenessprobe, but you would be receiving comments from everyone once you propose a PR.
@jralbert welcome to the community :) no issue, please reach out to us for any query. you could also reach out to us on #provider-openstack slack channel |
Hi Team, I'm getting below error on cinder-csi-plugin container. Can you please help to troubleshoot this as to why i'm getting below error. I0712 04:27:58.811070 1 driver.go:74] Driver: cinder.csi.openstack.org |
try |
Hi team! I set the "dnspolicy" of the pod in cinder-csi-controllerplugin.yaml as "Default". Default: Pod inherits worker node’s DNS configuration. ps: The tricky of this parameter is that the "default" value is not the "default"... |
@molimolinari that is an interesting finding. In general I think it would be fair to say that the Openstack cloud is always external to the k8s cluster, so using cluster DNS will not be helpful. Opening an issue or MR to change the CSI pod to use "Default" could be a good idea. |
this is good suggestion, I read the doc and seem |
@jichenjc
So it sounds like in practice, due to using hostNetwork, the current behaviour would have been the same as "Default". But wouldn't that mean the pods could not resolve k8s services? I am not sure about the details of why hostNetwork is needed/used, but that being the case, it seems like ClusterFirstWithHostNet could be better. But this doesn't explain how the recent comment #1387 (comment) could fix a problem, so I don't know. |
um... I dont' know either :( , maybe there are some magic between |
I'm guessing it all depends on the configuration of the nodes? If nodes can reach OpenStack APIs, then inheriting configuration from them by setting |
@molimolinari can you help check the #2574 comments ? Thanks |
What happened:
Some of the time, requests to the OpenStack APIs are successful, and with plugin debug logging turned way up I can see that the plugin is sometimes able to connect to Keystone, get a token, and then connect to Cinder to list and create volumes. However, most of the time connections to either Keystone or Cinder will appear to retry several times at 10-second intervals, and eventually fail with a message like:
I0120 17:39:17.616376 1 openstack.go:581] OpenStack connection error, retries exhausted. Aborting
W0120 17:39:17.616511 1 main.go:108] Failed to GetOpenStackProvider: Post "https://<CLOUD_DOMAIN>:5000/v3/auth/tokens": OpenStack connection error, retries exhausted. Aborting. Last error was: dial tcp <CLOUD_IP>:5000: i/o timeout
What you expected to happen:
For connections to the OpenStack API services to succeed as they do with other OpenStack clients, even with cURL from inside the cinder-csi-plugin containers.
How to reproduce it:
Just starting the plugin will cause the failure condition, almost immediately. This will result in containers exiting and going into CrashLoopBackoff, although enough connections succeed that if left to restart itself for long enough, volume actions will eventually get performed.
Anything else we need to know?:
This result occurs with both POSTs and GETs, and requests that often fail still do sometimes succeed.
Since I'm also in control of the OpenStack cloud infrastrucutre, I've been able to confirm in the API server logs that the requests which are being retried have in fact been received and responded to by the API servers, all with 2xx or 3xx response codes and in response times well under 10 seconds (most GETs are under 1 second), which leads me to think that there's not a genuine timeout condition here.
Basic connectivity testing with cURL inside the container where the plugin is running confirms that there's no difficulty or delay in querying the API server endpoints; tcpdump on the Kubernetes worker node hosting the container shows complete TCP conversations between the container and the API servers, with no indication of RSTs or broken connections.
Because I initially thought this must be an issue with the underlying gophercloud implementation that cinder-csi-plugin relies on, I first reported this bug on that project (gophercloud/gophercloud#2100) and had some good conversation with the project team about where this problem might arise, which ultimately ended up in the suggestion to open an issue on this repository.
Environment:
The text was updated successfully, but these errors were encountered: