Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Alert for abnormal jobs (including long running and low utilization jobs) #2629

Closed
sterowang opened this issue Apr 22, 2019 · 13 comments
Closed

Comments

@sterowang
Copy link

sterowang commented Apr 22, 2019

If a job runs longer than a perdefined time, such as 1 week, properly we need to send an alert email to job owner to prolong it lease time, otherwise job can be treated as orphan and be killed to free up resources:

Thanks Yanqing for the confirm.

One more thing: we had allocated your team 56 GPU cards on 179, per recent months report, this vc has 0 usage since 1/1. On the other side, other team on this cluster has waiting jobs due to the lack of VC capacity. I’m balancing the speech VC to other team. You are still able to submit jobs to “default” VC, and once your team has better usage, I will increase your VC. Thanks.

Best Regards,
Scarlett

From: Yanqing Liu
Sent: Tuesday, April 16, 2019 10:34 AM
To: Scarlett Li
Cc: PAI DRI
Subject: RE: Long run job on 179 (119d17h)

Hi Scarlett
It’s ok to stop it;
Sorry that I didn’t check my PAI status, it’s started by our intern who had recently left the team;

thx

From: Scarlett Li
Sent: Tuesday, April 16, 2019 9:53 AM
To: Yanqing Liu
Cc: PAI DRI
Subject: Long run job on 179 (119d17h)

Yanqing,

This long running job (119d17h) seems abnormal (cifar10 job with 1 gpu seems not need that long run), may you confirm it’s okey to Stop it and release the resources? Thanks.

Best Regards,
Scarlett

@xudifsd
Copy link
Member

xudifsd commented Apr 22, 2019

We need a way to enroll user's email information and passing user's email to job container, then job-exporter can attach user's email with job's resource usage metrics, then alert manager can send email using code from this branch.

@sterowang
Copy link
Author

@xudifsd can sync with @ydye for email info passing.

@xudifsd
Copy link
Member

xudifsd commented Apr 24, 2019

Offline synced with @ydye , user management will be ready in next release, so we can add this feature after next release.

@Binyang2014
Copy link
Contributor

Will add task_start_time in job_exporter then we can use the promQL to query task which running time more than n seconds

@xudifsd
Copy link
Member

xudifsd commented Jul 29, 2019

@Binyang2014 We do not need to add start_time to support this. I assume we meant to alert on long idle jobs, then we can write:

groups:
    - name: idle-job
      rules:
      - alert: zero-gpu-usage
        expr: avg(task_gpu_percent) by (user_email, job_name, vc_name) == 0
        for: 4h
        labels:
          type: user_alert

This will send out alert email using alert-manager, as long as we add user_email labels to task metrics.

@Binyang2014
Copy link
Contributor

Thanks @xudifsd will re-think about this. Currently, we want a feature which will high-light the job with low GPU usage as well as running a long time.

So we want to use promQL to query such job. Add start_time in job exporter is one optional solution.

@xudifsd
Copy link
Member

xudifsd commented Jul 29, 2019

@Binyang2014 In that case, should query prometheus using avg_over_time function. I'm strongly against adding start_time, it serves no use.

@Binyang2014
Copy link
Contributor

@xudifsd avg_over_time will not solve this problem. Since it will return the jobs which already exited. I agree that start_time is useless in current stage. Another solution is use avg_over_time first to get low GPU usage job then use and operator to find the instances which exist in current timestamp slice.

@xudifsd
Copy link
Member

xudifsd commented Jul 30, 2019

Ok, please add me to review when you're implementing this.

@scarlett2018 scarlett2018 changed the title Alert long running jobs Alert for abnormal jobs (including long running and low utilization jobs) Jul 31, 2019
@Binyang2014
Copy link
Contributor

Already add abnormal jobs in End-August release, close this issue

@IamSunGuangzhi
Copy link

Hi, @Binyang2014, sorry to bother. I have a question about this issues. How to auto monitor the status of tasks. For example, when the task happen error, i can receive the alert information right now. Thanks!
I use the branch v0.14.0 of OpenPAI.

@Binyang2014
Copy link
Contributor

Hi @IamSunGuangzhi, currently we don't support user job fail alert right now.
One reason for this is the user management is not mature, we don't know the job owner's email.
We only can send email to the admin group currently.

If you need this feature, you can create an issue, and we will discuss about it.

@IamSunGuangzhi
Copy link

Thanks for your reply, @Binyang2014 . I have submitted an issue 3602 .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants