-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
increase docker-healthcheck respose timeout #5644
Conversation
In case of increased I/O load, the 10sec timeout is not enough on small / heavily loaded systems thus I propose the 60sec. The kubelet timeout is 2m (120sec) by default to detect health problems. Secondly, the docker restart can load heavily the host OS even huge systems because of many pods initialization at the same time. Continuous dockerd restart loop - a deadlock of node - is observed. Thirdly, because of the forcibly closed sockets and the kernel TCP TIME_WAIT value, the TCP sockets are not usable immediately with a "restart", wait for FIN_TIMEOUT is necessary before start services. Workaround kubernetes#1 for: kubernetes#5434
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
...models/nodeup/docker/_systemd/_debian_family/files/opt/kubernetes/helpers/docker-healthcheck
Show resolved
Hide resolved
...models/nodeup/docker/_systemd/_debian_family/files/opt/kubernetes/helpers/docker-healthcheck
Show resolved
Hide resolved
Also @tatobi be sure to sign the CLA. /ok-to-test |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this timeout dynamic instead? I can see people who don’t want to wait 60 seconds here and prefer getting a failed early.
We could, but we should find precisely the metrics which affect the dynamism. The current operation is a bit rough of course, we could check the running container's statuses instead - maybe in next version. Do you think monitoring all pod's status after a stop / start operation would be more appropriate? |
I wouldn't say we need to get down to that level as that'd be quite compute costly. But making the timeout dynamic could allow to make this specific to certain environments and their specifics. |
/retest I think we should get this in, I think the behaviour here is much better - fewer false positives and likely much more visible because of the /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ihoegen, justinsb, tatobi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The errors travis is picking up are the old mis-spellings, because travis doesn't test the code after rebasing. Force-merging. |
In case of increased I/O load, the 10sec timeout is not enough on small / heavily loaded systems thus I propose the 60sec. The kubelet timeout is 2m (120sec) by default to detect health problems. Secondly, the docker restart can load heavily the host OS even huge systems because of many pods initialization at the same time. Continuous dockerd restart loop - a deadlock of node - is observed. Thirdly, because of the forcibly closed sockets and the kernel TCP TIME_WAIT value, the TCP sockets are not usable immediately with a "restart", wait for FIN_TIMEOUT is necessary before start services.
Workaround #1 for: #5434