Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use machine type n1-standard-2 to avoid OOM killing #17743

Closed
wants to merge 1 commit into from

Conversation

bart0sh
Copy link
Contributor

@bart0sh bart0sh commented May 28, 2020

Jobs that create 105 pods on COS are regularly triggering
kernel OOM killer. That causes job falures.

Used n1-standard-2 instance type with 7.5 Gb RAM to give
tests processes more memory.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 28, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bart0sh
To complete the pull request process, please assign derekwaynecarr
You can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bart0sh
Copy link
Contributor Author

bart0sh commented May 28, 2020

@MHBauer
Copy link
Contributor

MHBauer commented May 28, 2020

What does that look like in the logs? I'm looking at https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-benchmark which uses this config.

@bart0sh
Copy link
Contributor Author

bart0sh commented May 28, 2020

Here is what I could find in the latest logs:

./_artifacts/n1-standard-1-cos-dev-83-13020-12-0-e2996693-system.log:May 28 11:51:24 n1-standard-1-cos-dev-83-13020-12-0-e2996693 kernel:  oom_kill_process+0xb1/0x280
./_artifacts/n1-standard-1-cos-dev-83-13020-12-0-e2996693-system.log:May 28 11:51:24 n1-standard-1-cos-dev-83-13020-12-0-e2996693 kernel:  ? oom_evaluate_task+0x137/0x160
...
./_artifacts/n1-standard-1-cos-dev-83-13020-12-0-e2996693-system.log:May 28 11:51:25 n1-standard-1-cos-dev-83-13020-12-0-e2996693 kernel: oom_reaper: reaped process 1860 (e2e_node.test), now anon-rss:0kB, file-rss:0kB, shmem-rss:768kB
...
./_artifacts/n1-standard-1-cos-dev-83-13020-12-0-e2996693-system.log:May 28 11:51:25 n1-standard-1-cos-dev-83-13020-12-0-e2996693 kubelet[1645]: I0528 11:51:18.295644    1645 event.go:278] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"n1-standard-1-cos-dev-83-13020-12-0-e2996693", UID:"n1-standard-1-cos-dev-83-13020-12-0-e2996693", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SystemOOM' System OOM encountered, victim process: e2e_node.test, pid: 1860

This is an extreme case as test process was killed. Sometimes it's less obvious - OOM killer kills runc and even seemingly unrelated processes.

@spiffxp
Copy link
Member

spiffxp commented May 28, 2020

/hold
I'm wary of "just increase resources" fixes, it could be that we're hiding a legitimate problem of performance/resource-usage regressions

/cc @karan
I would be curious to get some input from folks with COS expertise

@k8s-ci-robot k8s-ci-robot requested a review from karan May 28, 2020 19:45
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 28, 2020
@spiffxp
Copy link
Member

spiffxp commented May 28, 2020

For example, I might be ok with this as a temporary / unblocking fix if there is a commitment to get back under the threshold. But I don't think we should just bump resources and never look back.

@spiffxp
Copy link
Member

spiffxp commented May 28, 2020

/cc @bsdnet

@k8s-ci-robot
Copy link
Contributor

@spiffxp: GitHub didn't allow me to request PR reviews from the following users: bsdnet.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @bsdnet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@karan
Copy link
Contributor

karan commented May 28, 2020

+1 to what @spiffxp said.

What jobs are scheduled on the node? What is their resource consumption? Can we instead tune them rather than double the machine size itself?

@bsdnet
Copy link
Contributor

bsdnet commented May 29, 2020

For this issue, we need to explore more. Why 105 is picked, and whether system memory increase like systemd, containerd, runc or there is some memory leak. Any step to run this test specifically, I can help debug in the background.

@bart0sh
Copy link
Contributor Author

bart0sh commented May 29, 2020

Sure, I'll try to investigate this further.

Just want to point out that n1-standard-2 machine type is not something new here. It started to be used in this config 2.5 years ago.
Currently 2 configurations use it:

Does anybody know what was the reason for this?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2020
@bsdnet
Copy link
Contributor

bsdnet commented May 29, 2020

Does anybody know what was the reason for this?

I do not know. But when I read code, came accross the following comment:
https://github.com/kubernetes/kubernetes/blob/b0c1fd19fcb6cc508bb6aa461594eac9e456960a/test/e2e_node/runner/remote/run_remote.go#L132
Only benchmark is supposed to use machine type.

@MHBauer
Copy link
Contributor

MHBauer commented May 29, 2020

I'm wondering if these need to run at all anymore.

Tracing through history, the original proto-KEP, pre-KEP: kubernetes/enhancements#83
In order to help out with max-pods-per-node defaults kubernetes/kubernetes#23349 (comment)

The tests are in this file, https://github.com/kubernetes/kubernetes/blame/master/test/e2e_node/density_test.go#L118-L156

It looks like these results fill in http://node-perf-dash.k8s.io/#/builds
It seems retired https://github.com/kubernetes-retired/contrib/blob/master/node-perf-dash/README.md

It looks like maxpods was last updated to 110 https://github.com/kubernetes/kubernetes/pull/21361/files a long time ago.

@bsdnet
Copy link
Contributor

bsdnet commented May 30, 2020

@MHBauer This is good info. Unfortunately, when I asked around, it is hard to know why those numbers are there today. OOM killer is normal when system is under memory pressure. My concern is that whether runc should be the one being picked.

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 2, 2020

@bsdnet I've investigated it a bit further. one test (--focus="create 105 pods with 0s? interval [Benchmark]") runs more or less ok on cos-69-10895-385-0 and fails on cos-81-12871-119-0.

I was running this test on n1-standard instances with cos-69 and cos-81 and looking at the free -h output during the run.

on cos-69 minimum of free memory was 938Mi:

ed@n1-standard-1-cos-69-10895-385-0-856f4264 /tmp/node-e2e-20200602T115711 $ free -h
              total        used        free      shared  buff/cache   available
Mem:          3.6Gi       1.1Gi       938Mi       972Mi       1.6Gi       1.4Gi
Swap:            0B          0B          0B

on cos-81 it was 112Mi and after that the instance hanged, so I couldn't type anything.

ed@n1-standard-1-cos-81-12871-119-0-d0fc1801 ~ $ free -h
              total        used        free      shared  buff/cache   available
Mem:          3.6Gi       2.9Gi       112Mi       532Mi       619Mi        41Mi
Swap:            0B          0B          0B

After some time the instance was available again and it turned out that kernel OOM killer killed cadvisor and e2e_node.test processes:

ed@n1-standard-1-cos-81-12871-119-0-d0fc1801 ~ $ dmesg |grep -i oom_reaper
[  508.340021] oom_reaper: reaped process 1978 (cadvisor), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  563.531990] oom_reaper: reaped process 1802 (e2e_node.test), now anon-rss:0kB, file-rss:0kB, shmem-rss:2296kB

I used master branch for this test. The issue is reproducible almost 100%.

Any suggestions how to continue? I can find out minimum amount of pods that trigger this issue on cos-81 if that helps.

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 2, 2020

Lists of most memory consuming processes from both instances:

cos-81:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                   
  50280 root      20   0   11.1g 415796  84460 S   1.7  11.0   0:05.83 e2e_node.test                                                                                             
    310 root      20   0 1944200 139496   2160 S  10.0   3.7   1:21.60 dockerd                                                                                                   
  50150 root      20   0 1623228 104092  63032 S   5.3   2.7   0:04.91 kubelet                                                                                                   
  49992 root      20   0  844920  91156  69480 S   0.0   2.4   0:00.40 e2e_node.test                                                                                             
    304 root      20   0 2522160  40716      0 S   1.7   1.1   0:09.16 containerd                                                                                                
     94 root      20   0  177084  30312  29548 S   0.7   0.8   0:06.56 systemd-journal                                                                                           
  54279 root      20   0  744364  28104  17132 S   1.7   0.7   0:00.34 cadvisor           

cos-69:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                   
 181663 root      20   0   10.8g 406892  86520 S   2.0  10.7   0:05.81 e2e_node.test                                                                                             
     91 root      20   0  319452 210280 209788 S   0.0   5.6   0:55.59 systemd-journal                                                                                           
      1 root      20   0  218364 121224   5096 S   0.0   3.2   3:26.80 systemd                                                                                                   
 181469 root      20   0  517168  93324  69268 S   0.0   2.5   0:00.36 e2e_node.test                                                                                             
 181590 root      20   0  737092  85860  64240 S   0.7   2.3   0:00.68 kubelet                                                                                                   
    306 root      20   0 1444100  61816  22000 S   0.3   1.6   2:48.58 dockerd                                                                                                   
 181730 root      20   0  728228  30364  17484 S   1.7   0.8   0:00.41 cadvisor         

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 3, 2020

I've tested this with different COS images. It looks like the test starts failing on cos-dev-73-11636-0-0:

Here is a list of images I've tested:

  • cos-69-10895-385-0 works
  • cos-73-11647-534-0 doesn't work
  • cos-stable-71-11151-71-0 works
  • cos-stable-72-11316-171-0 works
  • cos-dev-73-11391-0-0 works
  • cos-dev-73-11517-0-0 works
  • cos-dev-73-11553-0-0 works
  • cos-dev-73-11636-0-0 doesn't work
  • cos-dev-73-11647-18-0 doesn't work
  • cos-beta-73-11647-35-0 doesn't work
  • cos-73-11647-112-0 doesn't work
  • cos-73-11647-559-0 doesn't work

release notes for cos-dev-73-11636-0-0 (taken from Container-Optimized OS - Release Notes):

Date:           Jan 24, 2019
Kernel:         ChromiumOS-4.14
Kubernetes:     v1.13.2
Docker:         v18.09.0
Changelog (vs 73-11553-0-0):
    * Made containerd run as a standalone systemd service.
    * Updated the built-in kubelet to 1.13.2.
    * Reenabled kernel.softlockup_all_cpu_backtrace sysctl.
    * Disabled the CONFIG_DEVMEM configuration option in the kernel.
    * Enabled kernel module signing.
    * Installed a new package keyutils.
    * Updated mdadm to 4.1.

@MHBauer
Copy link
Contributor

MHBauer commented Jun 4, 2020

I don't know if it's the root cause, but the containerd shim has gotten a little bit fatter over time. Maybe just enough to throw it over the edge.

I think we need to take a step back and look at the contents and users of the file a bit deeper. I see duplication now that the image references are all updated to the most up to date. I also think we could probably modify the caller to reduce the duplication.

I'm not sure if whoever relies on these outputs is paying attention. @lorqor

@bsdnet
Copy link
Contributor

bsdnet commented Jun 5, 2020

Thanks @bart0sh bart0sh
The following change looks suspicious:

  • Made containerd run as a standalone systemd service.

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 5, 2020

@bsdnet

The following change looks suspicious:

  • Made containerd run as a standalone systemd service.

What can we do about it?

In my opinion we still have two choices short term:

  • decrease amount of pods created in the test
  • increase amount of memory in the instance (this PR)

Any other ideas?

@bsdnet
Copy link
Contributor

bsdnet commented Jun 5, 2020

I think for now, we need to "decrease amount of pods created in the test".
Containerd will be independent in future. However, I am surprised to see
that there are 15% impact (105-90)/105

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 5, 2020

@bsdnet

However, I am surprised to see that there are 15% impact (105-90)/105

It didn't work with 100 pods. I thought that 10% lower number would give us enough memory and safety buffer. I can try if it works with 95 if that matters.

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 5, 2020

@bsdnet I've tried increasing number of pods to 95. It triggered OOM killer and killed cadvisor and e2e_node.test.

Here is a memory consumption picture around the peak of consumption, just before OOM killer starts its job. Note that 98.7% of memory has been consumed.


top - 09:26:17 up 7 min,  1 user,  load average: 71.86, 23.30, 8.49
Tasks: 474 total,   1 running, 469 sleeping,   0 stopped,   4 zombie
%Cpu(s): 30.3 us, 32.4 sy,  0.3 ni,  0.0 id, 37.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 98.7/3697.2   [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
MiB Swap:  0.0/0.0      [                                                                                                    ]

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                   
   1814 root      20   0   11.2g 432152  88004 S   2.4  11.4   0:12.46 e2e_node.test                                                                                             
    298 root      20   0 1950748 149912      0 S   0.2   4.0   0:42.91 dockerd                                                                                                   
   1600 root      20   0 1625568 133092  65312 S   4.4   3.5   0:16.31 kubelet                                                                                                   
    756 root      20   0  853820 125136  70884 S   0.6   3.3   0:02.28 e2e_node.test                                                                                             
   1976 root      20   0  778108  43952     64 S   1.7   1.2   0:08.62 cadvisor                                                                                                  
    292 root      20   0 2333396  33328      0 S   0.2   0.9   0:02.01 containerd                                                                                                
  15211 root      20   0  599944  20528   2180 D   0.8   0.5   0:00.08 containerd                                                                                                
  15159 root      20   0  599944  20460   2108 D   0.6   0.5   0:00.08 containerd                                                                                                
  15174 root      20   0  599944  20316   1976 D   0.6   0.5   0:00.08 containerd                                                                                                
  15167 root      20   0  599944  20312   1968 D   0.8   0.5   0:00.08 containerd                                                                                                
  15136 root      20   0  599944  20276   1944 D   0.9   0.5   0:00.09 containerd                                                                                                
  15219 root      20   0  599944  20240   1888 D   0.6   0.5   0:00.08 containerd                                                                                                
  15201 root      20   0  599944  20228   1880 D   0.8   0.5   0:00.08 containerd                                                                                                
  15150 root      20   0  525892  19892   1600 D   0.3   0.5   0:00.05 containerd                                                                                                
  15181 root      20   0  525892  19848   1560 D   0.3   0.5   0:00.05 containerd                                                                                                
  15214 root      20   0  525892  19792   1500 D   0.2   0.5   0:00.04 containerd                                                                                                
  15192 root      20   0  599944  19748   1408 D   0.6   0.5   0:00.07 containerd                                                                                                
  15198 root      20   0  598536  19568   1308 D   0.8   0.5   0:00.08 containerd                                                                                                
  15157 root      20   0  599944  19560   1216 D   0.9   0.5   0:00.09 containerd                                                                                                
  15163 root      20   0  599944  19496   1852 D   0.6   0.5   0:00.07 containerd                                                                                                
  15199 root      20   0  599944  19348   1092 D   1.5   0.5   0:00.14 containerd                                                                                                
  15180 root      20   0  598536  19332   1080 D   0.8   0.5   0:00.08 containerd                                                                                                
  15160 root      20   0  599944  19284   1024 D   0.6   0.5   0:00.07 containerd                                                                                                
  15173 root      20   0  599880  19264   1004 D   1.7   0.5   0:00.14 containerd                                                                                                
  15212 root      20   0  599944  19196   1568 D   0.6   0.5   0:00.07 containerd                                                                                                
  15166 root      20   0  599944  19180    920 D   0.3   0.5   0:00.05 containerd                                                                                                
  15178 root      20   0  599944  19124    824 D   0.6   0.5   0:00.07 containerd                                                                                                
  15179 root      20   0  599944  19116   1544 D   0.5   0.5   0:00.07 containerd                                                                                                
  15188 root      20   0  599944  19012    720 D   0.6   0.5   0:00.07 containerd                                                                                                
  15128 root      20   0  599944  18996    720 D   0.6   0.5   0:00.08 containerd                                                                                                
  15162 root      20   0  599944  18940   1328 D   0.5   0.5   0:00.07 containerd                                                                                                
  15218 root      20   0  599944  18880   1264 D   0.6   0.5   0:00.07 containerd                                                                                                
  15158 root      20   0  599944  18868   1248 D   0.6   0.5   0:00.07 containerd                                                                                                
  15204 root      20   0  599944  18848    952 D   0.6   0.5   0:00.07 containerd                                                                                                
  15016 root      20   0  599944  18840   1012 D   0.3   0.5   0:00.08 containerd                                                                                                
...

I agree with you regarding containerd being a culprit here.

With 90 pods peak of memory consumption is around 95%. It makes it possible to avoid OOM triggering, but it's still quite high in my opinion.

@vpickard
Copy link
Contributor

vpickard commented Jun 5, 2020

@bartosh I think the preferred approach to fixing the benchmark test is in your other PR 91813, which reduces the number of test pods. Meaning, this PR can be closed now.

kubernetes/kubernetes#91813

I also opened this issue for tracking the root cause of the increase in memory consumption.
#17853

@bsdnet
Copy link
Contributor

bsdnet commented Jun 6, 2020

Thanks @bart0sh for the detailed info.
From the snapshot you posted, it is about 15% memory.
Sorry, I am kept busy this week, and did not get time responds timely.

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 6, 2020

@vpickard Closing as suggested. I'll submit another PR to change job yaml.

@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 16, 2020

Reopening as decreasing amount of pods to 90 is not an option because 100 is an official maximum.

Note that machine types can be changed after #17853 is fixed. However, we shouldn't wait for that. We need to fix broken tests.

@bart0sh bart0sh reopened this Jun 16, 2020
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jun 16, 2020
Jobs that create 105 pods on COS are regularly triggering
kernel OOM killer. That causes job falures.

Used n1-standard-2 instance type with 7.5 Gb RAM to give
tests processes more memory.
@bart0sh bart0sh force-pushed the PR0007-n1-standard-2 branch from 794a164 to 737da54 Compare June 16, 2020 12:09
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2020
@bart0sh
Copy link
Contributor Author

bart0sh commented Jun 23, 2020

Closing as kubernetes/kubernetes#91813 has been merged. As we decreased amount of pods there is no need to use n1-standard-2 instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants