-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed intermittent KeyError when handling a finished future in the task worker #11350 #11591
Fixed intermittent KeyError when handling a finished future in the task worker #11350 #11591
Conversation
…es to stimulate race condition in worker.py
Workflow for task_runner tests in multiprocessing mode in the macOS environment to stimulate conditions for the KeyError issue
running specific test cases made to test the fix
…libri into release-v0.16.x
@MisRob we would love for you to review this change! For some reason we can't choose a reviewer under the Reviewers section. |
@bjester we would love for you to review this change to fix the race condition resulting in the KeyError issue! For some reason, we can't choose a reviewer under the Reviewers section. |
…libri into release-v0.16.x
…lows already run all tests
Build Artifacts
|
kolibri/core/tasks/worker.py
Outdated
with self.future_job_mapping_lock: | ||
# Check if the job ID already exists in the future_job_mapping dictionary | ||
if job.job_id in self.future_job_mapping: | ||
logger.error(f"Job ID {job.job_id} is already in future_job_mapping.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up that we can't use f-strings as we currently still have support for Python 2.7 - will need to use older string formatting methods instead!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reformatted and pushed. thank you!
kolibri/core/tasks/worker.py
Outdated
# Acquire locks before accessing dictionaries | ||
with self.job_future_mapping_lock: | ||
with self.future_job_mapping_lock: | ||
if future in self.job_future_mapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More so than the locks, it seems like this line is the one doing most of the work here?
Your test above is not obeying the locks at all, so this existence check is preventing errors from occurring, in which case, this may just work without the locks at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to fix the error, although perhaps doesn't quite get at the root of when it happens. Hopefully the additional logging will help to identify that.
Docs build is being fixed independently. |
9d9166e
into
learningequality:release-v0.16.x
Summary
Introduced locks around critical sections of the code where the job_future_mapping and future_job_mapping dictionaries are accessed or modified. This was done to synchronize access to the shared data structures to ensure modifications are performed atomically and race conditions that could lead to KeyError issues are prevented.
We manually verified our fixes work through unit test cases that reproduced the conditions that could lead to a KeyError. One test reproduces a scenario where a KeyError might occur due to the deletion of a future while the job is still running. Another test case includes multiple jobs with the same ID, triggering potential race conditions to check if the fix is effective not only for a single job but also when multiple jobs with the same ID are involved. This is done because Kolibri's implementation reuses job IDs.
References
This attempts to fix the issue #11350
Testing checklist
PR process
Reviewer checklist
yarn
andpip
)