Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

job-exporter refactor proposal #1764

Closed
xudifsd opened this issue Nov 27, 2018 · 6 comments
Closed

job-exporter refactor proposal #1764

xudifsd opened this issue Nov 27, 2018 · 6 comments
Assignees

Comments

@xudifsd
Copy link
Member

xudifsd commented Nov 27, 2018

Background

Job exporter is essentially a metrics collector, to full fill its purpose, it will need to call multiple external commands, which makes job exporter vulnerable to commands hang. Following are the commands it calls:

  • docker stats --no-stream --format "table {{.Container}},{{.Name}},{{.CPUPerc}},{{.MemUsage}},{{.NetIO}},{{.BlockIO}},{{.MemPerc}}"
  • docker inspect $container_id
  • docker logs --tail all $container_id
  • systemctl is-active docker | if [ $? -eq 0 ]; then echo "active"; else exit 1; fi
  • nvidia-smi -q -x
  • iftop -t -P -s 1 -L 10000 -B -n -N
  • infilter $pid /usr/bin/lsof -I -n -P

Except docker logs and infilter commands, we have experienced cases of all other commands hang.

Problems

  • Hangs on commands, this will causes: not showing some metrics or even worse: metrics stay flat and user/admin doesn't know it.
  • Single threaded model is sensitive for all commands hang, if one command hangs, all other metrics will not be updated. For example, docker stats command has nothing to do with iftop command, hanging in iftop command should only affect network related metrics not gpu utilization.
  • We have no idea of how much time we spent on commands. This info is useful for debugging metrics related problems and alerting.

Previously we have designed some ways tailored for nvidia-smi to mitigate its hanging, but it lacks some features we requires:

  • It did not solve the problem of single threaded model.
  • It did not record the time spent on command calls.

Solution

  • Use subprocess32 which is backport of subprocess in python3 to python2, it provides timeout functionality we required. Or we simply upgrade job-exporter to python3.
  • Refactored job-exporter to multi-threaded models.
  • Do some alerting on absurd large time (we can change it to reasonable one once we gained enough data).
  • Use port to expose metrics instead of current file based one.
@xudifsd xudifsd added this to the current release milestone Nov 27, 2018
@xudifsd xudifsd self-assigned this Nov 27, 2018
@fanyangCS
Copy link
Contributor

Sounds good. Since we will be able to figure out the timeout, what will grafana shows in this case? an intermittent line instead of continuous line?

@xudifsd
Copy link
Member Author

xudifsd commented Nov 29, 2018

a flat continuous line, since job-exporter will report the same old value when hangs.

@fanyangCS
Copy link
Contributor

fanyangCS commented Nov 29, 2018

if this is the case, the system still cannot figure out whether the value is outdated or not. Can the exporter not produce any data in case of timeout? just like the exporter is not functioning (since we have readiness probe, so we know the exporter is healthy even it misses some metrics)

@fanyangCS
Copy link
Contributor

relevant issue: #1729

@xudifsd
Copy link
Member Author

xudifsd commented Nov 29, 2018

sounds resonable, I'll take a note.

@fanyangCS
Copy link
Contributor

relevant issue: #1719

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants