This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 549
job-exporter refactor proposal #1764
Labels
Comments
Sounds good. Since we will be able to figure out the timeout, what will grafana shows in this case? an intermittent line instead of continuous line? |
a flat continuous line, since job-exporter will report the same old value when hangs. |
if this is the case, the system still cannot figure out whether the value is outdated or not. Can the exporter not produce any data in case of timeout? just like the exporter is not functioning (since we have readiness probe, so we know the exporter is healthy even it misses some metrics) |
relevant issue: #1729 |
sounds resonable, I'll take a note. |
Merged
relevant issue: #1719 |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Background
Job exporter is essentially a metrics collector, to full fill its purpose, it will need to call multiple external commands, which makes job exporter vulnerable to commands hang. Following are the commands it calls:
docker stats --no-stream --format "table {{.Container}},{{.Name}},{{.CPUPerc}},{{.MemUsage}},{{.NetIO}},{{.BlockIO}},{{.MemPerc}}"
docker inspect $container_id
docker logs --tail all $container_id
systemctl is-active docker | if [ $? -eq 0 ]; then echo "active"; else exit 1; fi
nvidia-smi -q -x
iftop -t -P -s 1 -L 10000 -B -n -N
infilter $pid /usr/bin/lsof -I -n -P
Except
docker logs
andinfilter
commands, we have experienced cases of all other commands hang.Problems
docker stats
command has nothing to do withiftop
command, hanging iniftop
command should only affect network related metrics not gpu utilization.Previously we have designed some ways tailored for
nvidia-smi
to mitigate its hanging, but it lacks some features we requires:Solution
subprocess32
which is backport of subprocess in python3 to python2, it provides timeout functionality we required. Or we simply upgrade job-exporter to python3.The text was updated successfully, but these errors were encountered: