Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Job] Repro race issue #34190

Closed

Conversation

sihanwang41
Copy link
Contributor

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Sihan Wang <[email protected]>
architkulkarni added a commit to architkulkarni/ray that referenced this pull request Apr 10, 2023
rkooo567 pushed a commit that referenced this pull request Apr 12, 2023
…t for pending jobs (#34223)

@rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens:

A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created.
In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV.
The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user.
If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed.

The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop.

This PR adds a unit test that fails with #34190 (which forces the race condition).

This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable.

Some questions are still open:

Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id #33259, but it's not clear how this would matter in the case of a single job.
What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider.
architkulkarni added a commit to architkulkarni/ray that referenced this pull request Apr 12, 2023
…t for pending jobs (ray-project#34223)

@rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens:

A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created.
In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV.
The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user.
If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed.

The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop.

This PR adds a unit test that fails with ray-project#34190 (which forces the race condition).

This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable.

Some questions are still open:

Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id ray-project#33259, but it's not clear how this would matter in the case of a single job.
What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider.
clarng pushed a commit that referenced this pull request Apr 12, 2023
…t for pending jobs (#34223) (#34318)

@rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens:

A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created.
In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV.
The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user.
If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed.

The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop.

This PR adds a unit test that fails with #34190 (which forces the race condition).

This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable.

Some questions are still open:

Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id #33259, but it's not clear how this would matter in the case of a single job.
What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider.
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…t for pending jobs (ray-project#34223)

@rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens:

A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created.
In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV.
The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user.
If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed.

The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop.

This PR adds a unit test that fails with ray-project#34190 (which forces the race condition).

This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable.

Some questions are still open:

Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id ray-project#33259, but it's not clear how this would matter in the case of a single job.
What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider.

Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…t for pending jobs (ray-project#34223)

@rkooo567 and @sihanwang41 found a race condition when submitting a job causing the job to fail. The failure happens when this sequence of events happens:

A job is submitted. Its job_info is put to the internal KV. This happens here, before the JobSupervisor is actually created.
In the constructor of JobManager, we call await self._recover_running_jobs(), which finds the job_info in the internal KV and starts to monitor that job. Because the JobSupervisor actor doesn't exist yet, the JobManager job monitoring loop fails to ping it, and puts the status of this job as FAILED in the internal KV.
The JobSupervisor is created. JobSupervisor.run() checks that the status is PENDING, but it's not, so it raises the error "run should only be called once" which is not helpful to the user.
If step 2 happens before step 1, there's no issue. But these are both async, so the order isn't guaranteed.

The solution in this PR is to allow the JobManager monitoring loop to handle the case PENDING. It handles it by skipping the ping to the JobSupervisor actor for that iteration of the loop.

This PR adds a unit test that fails with ray-project#34190 (which forces the race condition).

This PR also adds a timeout to fail jobs that have been pending for 15 minutes, configurable via environment variable.

Some questions are still open:

Why did this only start to fail recently? The only recent change is [Jobs] Fix race condition on submitting multiple jobs with the same id ray-project#33259, but it's not clear how this would matter in the case of a single job.
What is a reasonable default timeout for pending jobs, and should we even have one? It should be larger than the existing runtime_env setup timeout (10 minutes) in order to distinguish runtime env setup timeouts from other timeouts. Not sure if there are other existing timeouts that we should consider.

Signed-off-by: Jack He <[email protected]>
@stale
Copy link

stale bot commented May 8, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 8, 2023
@stale
Copy link

stale bot commented Jun 10, 2023

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this Jun 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant