Skip to content

Commit

Permalink
ddl: design doc for admin pause/resume ddl jobs (pingcap#43159)
Browse files Browse the repository at this point in the history
  • Loading branch information
dhysum committed May 11, 2023
1 parent 5654903 commit b4333c2
Show file tree
Hide file tree
Showing 2 changed files with 106 additions and 0 deletions.
106 changes: 106 additions & 0 deletions docs/design/2023-04-15-ddl-pause-resume.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Pause/Resume DDL jobs

- Author: [dhysum](https://github.com/dhysum)
- Tracking Issue:
- https://github.com/pingcap/tidb/issues/40041
- https://github.com/pingcap/tidb/issues/18015

## Background and Benefits

DDL jobs are running asynchronously in the background, which may take a very
long time to execute. The Admin could cancel the running DDL job if
necessary, for example, out of resource.

As a DDL job may take a lot of effort, it could be a big waste to
just cancel and restart the job. It has several stages and steps back in the
yard to finish a DDL job, we may just pause it at some step and resume it right
from that place.

Also, such feature will benefit the Upgrade of the TiDB Cluster.

## Goal

Add two commands to pause and resume a long-running DDL job. In particularly:

1. `admin pause ddl jobs 3,5;`

The jobs (here are job 3 and job 5) should be in Running or Queueing(Wait)
state. Other states should be rejected.

2. `admin resume ddl jobs 3,5;`

Only Paused jobs could be resumed. Other states should be rejected.

## Architecture

There is no change on the architecture, just following the current one.

## Detail Design

The whole design is simple:

1. Add pause/resume in parser
2. Add builder in planner
3. Add builder in executor
4. Valid the job's state, and turn it to be Pausing, if `admin pause ...`
by end user

4.1 Valid the state first, only Running job could be paused, just like
`admin cancel ...`

4.2 Turn the state to be Pausing

4.3 Background worker will check the state, and turn the state to be Paused
, then just return and stay in current stage; And this is what is different
from `admin cancel ...`

4.4 Background worker will continue fetching the job and check the state,
and keep doing nothing until the state changed

4.5 Specially, the Reorg state could take a long time (maybe minutes, even
hours), we also need to check the state of the job, and stop the reorg
accordingly

5. Validate the job's state, and turn it to be Queueing, if `admin resume ...`
by end user

5.1 After the job's state changes, the background worker should check the
job's state and continue the work

5.2 No other actions should be taken

### State Machine

![state-machine](./imgs/ddl_job_state_machine.png)

## Usage

1. Create an index

`create index idx_usr on t_user(name);`

2. In another session, show the DDL jobs

`admin show ddl jobs 15;`

It will show 15 DDL jobs, change the number if you want to see more. And
then, find the running one (job_id) you want to pause.

3. Pause the job by

`admin pause ddl jobs $job_id` ---- $job_id is what you get from upper step

4. Resume the paused job

Also, you can find the Paused jobs by (change the number if you want to see
more):

`admin show ddl jobs 15;`

Then, reumse it by:

`admin resume ddl jobs $job_id`

## Future Work

None.
Binary file added docs/design/imgs/ddl_job_state_machine.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b4333c2

Please sign in to comment.