Alert for abnormal jobs (including long running and low utilization jobs) #2629

sterowang · 2019-04-22T07:00:16Z

If a job runs longer than a perdefined time, such as 1 week, properly we need to send an alert email to job owner to prolong it lease time, otherwise job can be treated as orphan and be killed to free up resources:

Thanks Yanqing for the confirm.

One more thing: we had allocated your team 56 GPU cards on 179, per recent months report, this vc has 0 usage since 1/1. On the other side, other team on this cluster has waiting jobs due to the lack of VC capacity. I’m balancing the speech VC to other team. You are still able to submit jobs to “default” VC, and once your team has better usage, I will increase your VC. Thanks.

Best Regards,
Scarlett

From: Yanqing Liu
Sent: Tuesday, April 16, 2019 10:34 AM
To: Scarlett Li
Cc: PAI DRI
Subject: RE: Long run job on 179 (119d17h)

Hi Scarlett
It’s ok to stop it;
Sorry that I didn’t check my PAI status, it’s started by our intern who had recently left the team;

thx

From: Scarlett Li
Sent: Tuesday, April 16, 2019 9:53 AM
To: Yanqing Liu
Cc: PAI DRI
Subject: Long run job on 179 (119d17h)

Yanqing,

This long running job (119d17h) seems abnormal (cifar10 job with 1 gpu seems not need that long run), may you confirm it’s okey to Stop it and release the resources? Thanks.

Best Regards,
Scarlett

xudifsd · 2019-04-22T07:50:19Z

We need a way to enroll user's email information and passing user's email to job container, then job-exporter can attach user's email with job's resource usage metrics, then alert manager can send email using code from this branch.

sterowang · 2019-04-24T03:27:42Z

@xudifsd can sync with @ydye for email info passing.

xudifsd · 2019-04-24T07:32:52Z

Offline synced with @ydye , user management will be ready in next release, so we can add this feature after next release.

Binyang2014 · 2019-07-29T09:53:57Z

Will add task_start_time in job_exporter then we can use the promQL to query task which running time more than n seconds

xudifsd · 2019-07-29T10:04:19Z

@Binyang2014 We do not need to add start_time to support this. I assume we meant to alert on long idle jobs, then we can write:

groups:
    - name: idle-job
      rules:
      - alert: zero-gpu-usage
        expr: avg(task_gpu_percent) by (user_email, job_name, vc_name) == 0
        for: 4h
        labels:
          type: user_alert

This will send out alert email using alert-manager, as long as we add user_email labels to task metrics.

Binyang2014 · 2019-07-29T10:21:51Z

Thanks @xudifsd will re-think about this. Currently, we want a feature which will high-light the job with low GPU usage as well as running a long time.

So we want to use promQL to query such job. Add start_time in job exporter is one optional solution.

xudifsd · 2019-07-29T10:30:04Z

@Binyang2014 In that case, should query prometheus using avg_over_time function. I'm strongly against adding start_time, it serves no use.

Binyang2014 · 2019-07-30T02:25:05Z

@xudifsd avg_over_time will not solve this problem. Since it will return the jobs which already exited. I agree that start_time is useless in current stage. Another solution is use avg_over_time first to get low GPU usage job then use and operator to find the instances which exist in current timestamp slice.

xudifsd · 2019-07-30T02:39:41Z

Ok, please add me to review when you're implementing this.

Binyang2014 · 2019-09-05T13:51:03Z

Already add abnormal jobs in End-August release, close this issue

IamSunGuangzhi · 2019-09-12T03:05:52Z

Hi, @Binyang2014, sorry to bother. I have a question about this issues. How to auto monitor the status of tasks. For example, when the task happen error, i can receive the alert information right now. Thanks!
I use the branch v0.14.0 of OpenPAI.

Binyang2014 · 2019-09-12T05:15:03Z

Hi @IamSunGuangzhi, currently we don't support user job fail alert right now.
One reason for this is the user management is not mature, we don't know the job owner's email.
We only can send email to the admin group currently.

If you need this feature, you can create an issue, and we will discuss about it.

IamSunGuangzhi · 2019-09-12T06:10:06Z

Thanks for your reply, @Binyang2014 . I have submitted an issue 3602 .

sterowang mentioned this issue Apr 22, 2019

Highlight low resource utilization jobs on webportal #2249

Closed

scarlett2018 added the system label Apr 25, 2019

fanyangCS assigned Binyang2014 Jul 26, 2019

hzy46 mentioned this issue Jul 26, 2019

August 2019 Release Plan #3250

Closed

44 tasks

Binyang2014 added the service metrics/monitoring label Jul 30, 2019

scarlett2018 changed the title ~~Alert long running jobs~~ Alert for abnormal jobs (including long running and low utilization jobs) Jul 31, 2019

Binyang2014 closed this as completed Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert for abnormal jobs (including long running and low utilization jobs) #2629

Alert for abnormal jobs (including long running and low utilization jobs) #2629

sterowang commented Apr 22, 2019 •

edited by scarlett2018

Loading

xudifsd commented Apr 22, 2019

sterowang commented Apr 24, 2019

xudifsd commented Apr 24, 2019

Binyang2014 commented Jul 29, 2019

xudifsd commented Jul 29, 2019

Binyang2014 commented Jul 29, 2019

xudifsd commented Jul 29, 2019

Binyang2014 commented Jul 30, 2019

xudifsd commented Jul 30, 2019

Binyang2014 commented Sep 5, 2019

IamSunGuangzhi commented Sep 12, 2019

Binyang2014 commented Sep 12, 2019

IamSunGuangzhi commented Sep 12, 2019

Alert for abnormal jobs (including long running and low utilization jobs) #2629

Alert for abnormal jobs (including long running and low utilization jobs) #2629

Comments

sterowang commented Apr 22, 2019 • edited by scarlett2018 Loading

xudifsd commented Apr 22, 2019

sterowang commented Apr 24, 2019

xudifsd commented Apr 24, 2019

Binyang2014 commented Jul 29, 2019

xudifsd commented Jul 29, 2019

Binyang2014 commented Jul 29, 2019

xudifsd commented Jul 29, 2019

Binyang2014 commented Jul 30, 2019

xudifsd commented Jul 30, 2019

Binyang2014 commented Sep 5, 2019

IamSunGuangzhi commented Sep 12, 2019

Binyang2014 commented Sep 12, 2019

IamSunGuangzhi commented Sep 12, 2019

sterowang commented Apr 22, 2019 •

edited by scarlett2018

Loading