-
Notifications
You must be signed in to change notification settings - Fork 549
Alert for abnormal jobs (including long running and low utilization jobs) #2629
Comments
We need a way to enroll user's email information and passing user's email to job container, then job-exporter can attach user's email with job's resource usage metrics, then alert manager can send email using code from this branch. |
Offline synced with @ydye , user management will be ready in next release, so we can add this feature after next release. |
Will add task_start_time in job_exporter then we can use the promQL to query task which running time more than n seconds |
@Binyang2014 We do not need to add start_time to support this. I assume we meant to alert on long idle jobs, then we can write: groups:
- name: idle-job
rules:
- alert: zero-gpu-usage
expr: avg(task_gpu_percent) by (user_email, job_name, vc_name) == 0
for: 4h
labels:
type: user_alert This will send out alert email using alert-manager, as long as we add |
Thanks @xudifsd will re-think about this. Currently, we want a feature which will high-light the job with low GPU usage as well as running a long time. So we want to use promQL to query such job. Add start_time in job exporter is one optional solution. |
@Binyang2014 In that case, should query prometheus using avg_over_time function. I'm strongly against adding start_time, it serves no use. |
@xudifsd avg_over_time will not solve this problem. Since it will return the jobs which already exited. I agree that start_time is useless in current stage. Another solution is use |
Ok, please add me to review when you're implementing this. |
Already add abnormal jobs in End-August release, close this issue |
Hi, @Binyang2014, sorry to bother. I have a question about this issues. How to auto monitor the status of tasks. For example, when the task happen error, i can receive the alert information right now. Thanks! |
Hi @IamSunGuangzhi, currently we don't support user job fail alert right now. If you need this feature, you can create an issue, and we will discuss about it. |
Thanks for your reply, @Binyang2014 . I have submitted an issue 3602 . |
If a job runs longer than a perdefined time, such as 1 week, properly we need to send an alert email to job owner to prolong it lease time, otherwise job can be treated as orphan and be killed to free up resources:
Thanks Yanqing for the confirm.
One more thing: we had allocated your team 56 GPU cards on 179, per recent months report, this vc has 0 usage since 1/1. On the other side, other team on this cluster has waiting jobs due to the lack of VC capacity. I’m balancing the speech VC to other team. You are still able to submit jobs to “default” VC, and once your team has better usage, I will increase your VC. Thanks.
Best Regards,
Scarlett
From: Yanqing Liu
Sent: Tuesday, April 16, 2019 10:34 AM
To: Scarlett Li
Cc: PAI DRI
Subject: RE: Long run job on 179 (119d17h)
Hi Scarlett
It’s ok to stop it;
Sorry that I didn’t check my PAI status, it’s started by our intern who had recently left the team;
thx
From: Scarlett Li
Sent: Tuesday, April 16, 2019 9:53 AM
To: Yanqing Liu
Cc: PAI DRI
Subject: Long run job on 179 (119d17h)
Yanqing,
This long running job (119d17h) seems abnormal (cifar10 job with 1 gpu seems not need that long run), may you confirm it’s okey to Stop it and release the resources? Thanks.
Best Regards,
Scarlett
The text was updated successfully, but these errors were encountered: