Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Some datanode is killed for unknown reason #1658

Closed
mzmssg opened this issue Nov 2, 2018 · 9 comments
Closed

Some datanode is killed for unknown reason #1658

mzmssg opened this issue Nov 2, 2018 · 9 comments
Assignees

Comments

@mzmssg
Copy link
Member

mzmssg commented Nov 2, 2018

171 node
image

173 node
image

175 node
image

Not sure who killed them. But it might result in hdfs access failure.

@fanyangCS
Copy link
Contributor

fanyangCS commented Nov 2, 2018

I see the "cannot allocate memory ". Is it due to OOF?
@hao1939 do we have memory constraints for DataNode Pods?

@mzmssg mzmssg self-assigned this Nov 7, 2018
@mzmssg
Copy link
Member Author

mzmssg commented Nov 7, 2018

For a failed datanode, it's killed for readiness probe failure, container can't correctly response exec command
image

@mzmssg
Copy link
Member Author

mzmssg commented Nov 8, 2018

Seems due to our resource limitation, docker events show it's in oom status

Type=container  Status=exec_create: cat /jobstatus/jobok  ID=edf6b4b172edb0b46f36e2d5caa2b89e5bdd6b8403661593d9a9d1e50ca5a28f
Type=container  Status=exec_start: cat /jobstatus/jobok  ID=edf6b4b172edb0b46f36e2d5caa2b89e5bdd6b8403661593d9a9d1e50ca5a28f
Type=container  Status=oom  ID=2dc13c49a71342b0826940f6a9247e8120dba06f97a98d96117117226bf754c8
Type=container  Status=die  ID=2dc13c49a71342b0826940f6a9247e8120dba06f97a98d96117117226bf754c8
Type=container  Status=destroy  ID=8fc03a7ab925eaf703302af02caf218041b0dcdc2d7fbf357b6b0eb079840ebb
Type=container  Status=exec_create: cat /jobstatus/jobok  ID=edf6b4b172edb0b46f36e2d5caa2b89e5bdd6b8403661593d9a9d1e50ca5a28f

@fanyangCS
Copy link
Contributor

@mzmssg , can you explain it in details? whose readiness probe? I am not aware that we have a readiness check in DataNode. will k8s kill a pod when it is not ready? I think k8s will just show that pod is not ready, but not kill it?
where is this the log from, kubelet? Also, where is the oom log from?

@mzmssg
Copy link
Member Author

mzmssg commented Nov 8, 2018

@fanyangCS
Correct previous response, kill behavior should come from docker, rather than k8s.

So the story should be:

  1. Data-node container trigger memory limit
  2. Docker daemon destroy data-node container
  3. kubelet found data-node container dead, restart it.

Other details for your concern:

  1. We have a simple readiness probe for datanode. But it shouldn't matter for data-node restart. https://github.com/Microsoft/pai/blob/498b1cefb7c2987c371e38cda8223b673ed2b3e8/src/hadoop-data-node/deploy/hadoop-data-node.yaml.template#L46
  2. the log showing readiness probe failure come from kubelet
  3. oom log is the docker events record, could get it with sudo docker events --filter image=openpai/hadoop-run --format Type={{.Type}} Status={{.Status}} ID={{.ID}}

@hao1939
Copy link
Contributor

hao1939 commented Nov 8, 2018

For the 175 node, it should be cgroup oom, which means the container memory usage exceed the limits.

We observe 351 restarts since 11/1, and count the matching 'oom killing' numbers in system log. (The below log shows 353, and it includes previous killing before 11/1.)

...
Nov 08 02:04:36 paigcr-a-gpu-1037 kernel: Memory cgroup out of memory: Kill process 3231 (java) score 1330 or sacrifice child
Nov 08 02:09:48 paigcr-a-gpu-1037 kernel: Memory cgroup out of memory: Kill process 51419 (java) score 1335 or sacrifice child
core@paigcr-a-gpu-1037:~$ journalctl -k | grep -i -e memory -e oom |grep "Memory cgroup out of memory" |wc
353 6707 44884

@fanyangCS
Copy link
Contributor

here is the fix. #1689

@mzmssg , does yarn exclude NM and data node's memory and CPU resource when allocating resources to the job(s)?

@mzmssg
Copy link
Member Author

mzmssg commented Nov 12, 2018

@fanyangCS
Now we reserve 40G(single box) or 12G(cluster) for our service, the remaining memory for yarn.

@hao1939 Should we also increase the value in the fix.

@hao1939
Copy link
Contributor

hao1939 commented Nov 16, 2018

Fixed in PR #1689

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants