Some datanode is killed for unknown reason #1658

mzmssg · 2018-11-02T09:44:49Z

171 node

173 node

175 node

Not sure who killed them. But it might result in hdfs access failure.

fanyangCS · 2018-11-02T11:08:08Z

I see the "cannot allocate memory ". Is it due to OOF?
@hao1939 do we have memory constraints for DataNode Pods?

mzmssg · 2018-11-07T12:15:02Z

For a failed datanode, it's killed for readiness probe failure, container can't correctly response exec command

mzmssg · 2018-11-08T02:47:36Z

Seems due to our resource limitation, docker events show it's in oom status

Type=container  Status=exec_create: cat /jobstatus/jobok  ID=edf6b4b172edb0b46f36e2d5caa2b89e5bdd6b8403661593d9a9d1e50ca5a28f
Type=container  Status=exec_start: cat /jobstatus/jobok  ID=edf6b4b172edb0b46f36e2d5caa2b89e5bdd6b8403661593d9a9d1e50ca5a28f
Type=container  Status=oom  ID=2dc13c49a71342b0826940f6a9247e8120dba06f97a98d96117117226bf754c8
Type=container  Status=die  ID=2dc13c49a71342b0826940f6a9247e8120dba06f97a98d96117117226bf754c8
Type=container  Status=destroy  ID=8fc03a7ab925eaf703302af02caf218041b0dcdc2d7fbf357b6b0eb079840ebb
Type=container  Status=exec_create: cat /jobstatus/jobok  ID=edf6b4b172edb0b46f36e2d5caa2b89e5bdd6b8403661593d9a9d1e50ca5a28f

fanyangCS · 2018-11-08T04:07:38Z

@mzmssg , can you explain it in details? whose readiness probe? I am not aware that we have a readiness check in DataNode. will k8s kill a pod when it is not ready? I think k8s will just show that pod is not ready, but not kill it?
where is this the log from, kubelet? Also, where is the oom log from?

mzmssg · 2018-11-08T07:16:21Z

@fanyangCS
Correct previous response, kill behavior should come from docker, rather than k8s.

So the story should be:

Data-node container trigger memory limit
Docker daemon destroy data-node container
kubelet found data-node container dead, restart it.

Other details for your concern:

We have a simple readiness probe for datanode. But it shouldn't matter for data-node restart. https://github.com/Microsoft/pai/blob/498b1cefb7c2987c371e38cda8223b673ed2b3e8/src/hadoop-data-node/deploy/hadoop-data-node.yaml.template#L46
the log showing readiness probe failure come from kubelet
oom log is the docker events record, could get it with sudo docker events --filter image=openpai/hadoop-run --format Type={{.Type}} Status={{.Status}} ID={{.ID}}

hao1939 · 2018-11-08T08:01:31Z

For the 175 node, it should be cgroup oom, which means the container memory usage exceed the limits.

We observe 351 restarts since 11/1, and count the matching 'oom killing' numbers in system log. (The below log shows 353, and it includes previous killing before 11/1.)

...
Nov 08 02:04:36 paigcr-a-gpu-1037 kernel: Memory cgroup out of memory: Kill process 3231 (java) score 1330 or sacrifice child
Nov 08 02:09:48 paigcr-a-gpu-1037 kernel: Memory cgroup out of memory: Kill process 51419 (java) score 1335 or sacrifice child
core@paigcr-a-gpu-1037:~$ journalctl -k | grep -i -e memory -e oom |grep "Memory cgroup out of memory" |wc
353 6707 44884

fanyangCS · 2018-11-12T06:50:44Z

here is the fix. #1689

@mzmssg , does yarn exclude NM and data node's memory and CPU resource when allocating resources to the job(s)?

mzmssg · 2018-11-12T07:50:32Z

@fanyangCS
Now we reserve 40G(single box) or 12G(cluster) for our service, the remaining memory for yarn.

@hao1939 Should we also increase the value in the fix.

hao1939 · 2018-11-16T09:55:52Z

Fixed in PR #1689

fanyangCS assigned hao1939 Nov 2, 2018

mzmssg self-assigned this Nov 7, 2018

mzmssg mentioned this issue Nov 9, 2018

data node and node manager are repeatedly restarted on prod bed #1682

Closed

mzmssg mentioned this issue Nov 12, 2018

[hot-fix] increase data-node memory limits to 4Gi #1689

Merged

hao1939 mentioned this issue Nov 15, 2018

release v0.8.1 #1705

Merged

hao1939 closed this as completed Nov 16, 2018

fanyangCS mentioned this issue Mar 20, 2019

PAI doesn't clean job container #2354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some datanode is killed for unknown reason #1658

Some datanode is killed for unknown reason #1658

mzmssg commented Nov 2, 2018 •

edited

Loading

fanyangCS commented Nov 2, 2018 •

edited

Loading

mzmssg commented Nov 7, 2018 •

edited

Loading

mzmssg commented Nov 8, 2018

fanyangCS commented Nov 8, 2018

mzmssg commented Nov 8, 2018 •

edited

Loading

hao1939 commented Nov 8, 2018 •

edited

Loading

fanyangCS commented Nov 12, 2018

mzmssg commented Nov 12, 2018

hao1939 commented Nov 16, 2018

Some datanode is killed for unknown reason #1658

Some datanode is killed for unknown reason #1658

Comments

mzmssg commented Nov 2, 2018 • edited Loading

fanyangCS commented Nov 2, 2018 • edited Loading

mzmssg commented Nov 7, 2018 • edited Loading

mzmssg commented Nov 8, 2018

fanyangCS commented Nov 8, 2018

mzmssg commented Nov 8, 2018 • edited Loading

hao1939 commented Nov 8, 2018 • edited Loading

fanyangCS commented Nov 12, 2018

mzmssg commented Nov 12, 2018

hao1939 commented Nov 16, 2018

mzmssg commented Nov 2, 2018 •

edited

Loading

fanyangCS commented Nov 2, 2018 •

edited

Loading

mzmssg commented Nov 7, 2018 •

edited

Loading

mzmssg commented Nov 8, 2018 •

edited

Loading

hao1939 commented Nov 8, 2018 •

edited

Loading