-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Kill Hadoop MR task on kill of ingestion task and resume ability for Hadoop ingestion tasks #6803
Comments
Phase I sounds pretty useful, what entity would issue the yarn kill command? Do you have more details on the interfaces/code areas you plan to change for that? For Phase II, can you describe the use case in more detail? When would a task resume occur? What happens if the input to the ingestion task changes between retries (e.g. the task is reading from some s3 bucket which gets some new files)? |
Thanks @ankit0811. Sounds useful! For phase 1, I think we need a unified way to support various platforms like Hadoop, Spark, and so on. So, I would suggest to change the way killing Druid tasks. Currently, the overlord sends a kill request to a middleManager where the task is running. Then, the middleManager just destroys the task process. As a result, the task can't have a chance to prepare stopping like cleaning up its resources or killing Hadoop jobs. Instead, I think the task can clean up resources before stop by changing how the middleManager kills the task. This way makes more sense to me because the Hadoop job is started and killed in the same place (Druid Hadoop task). Fortunately, we have some logics already implemented. First, there's /**
* Asks a task to arrange for its "run" method to exit promptly. This method will only be called if
* {@link #canRestore()} returns true. Tasks that take too long to stop gracefully will be terminated with
* extreme prejudice.
*/
void stopGracefully(); This is currently only for restorable tasks, so you may want to make the hadoop task restorable (maybe related to phase 2?). |
@jon-wei for phase II will share a detailed use case soon |
@jihoonson so you are suggesting to go for phaseII first and then make a unified kill task support? |
@ankit0811 I'm also not sure what restoring means for Hadoop indexTask and how it's useful. I just mean, you may need to make Hadoop indexTask restorable no matter what it does on |
This issue has been marked as stale due to 280 days of inactivity. |
We plan to implement these features in a two phase solution:
Phase I : Implement the kill task feature
Currently on killing the Hadoop ingestion task from the overlord ui does not kill the MR job resulting in unnecessary wastage of resources.
The way we are thinking of doing this is by writing the job id of the MR job in a file
mapReduceJobId.json
which will be stored under thetaskBaseDir
A separate file will be required as the MR job is executed in different JVM and there is no way it could communicate with kill task snippet, so writing the required info in a file was the only option
Currently, this file will only store the running JobID, this way when some one wishes to kill the ingestion task it can read the current running MR job (if any) and issue a yarn kill command
Phase II: Implement the resume ability for Hadoop ingestion tasks
The above file will now store the job Id of each intermediate MR task so that we can track till what step the ingestion was completed and can resume from the following step instead of executing from the beginning
The text was updated successfully, but these errors were encountered: