Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed intermittent KeyError when handling a finished future in the task worker #11350 #11591

Merged

Conversation

adviti-mishra
Copy link
Contributor

@adviti-mishra adviti-mishra commented Dec 5, 2023

Summary

Introduced locks around critical sections of the code where the job_future_mapping and future_job_mapping dictionaries are accessed or modified. This was done to synchronize access to the shared data structures to ensure modifications are performed atomically and race conditions that could lead to KeyError issues are prevented.

We manually verified our fixes work through unit test cases that reproduced the conditions that could lead to a KeyError. One test reproduces a scenario where a KeyError might occur due to the deletion of a future while the job is still running. Another test case includes multiple jobs with the same ID, triggering potential race conditions to check if the fix is effective not only for a single job but also when multiple jobs with the same ID are involved. This is done because Kolibri's implementation reuses job IDs.

References

This attempts to fix the issue #11350


Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

Workflow for task_runner tests in multiprocessing mode in the macOS environment to stimulate conditions for the KeyError issue
running specific test cases made to test the fix
@github-actions github-actions bot added DEV: dev-ops Continuous integration & deployment SIZE: small labels Dec 5, 2023
@adviti-mishra
Copy link
Contributor Author

@MisRob we would love for you to review this change! For some reason we can't choose a reviewer under the Reviewers section.

@adviti-mishra
Copy link
Contributor Author

@bjester we would love for you to review this change to fix the race condition resulting in the KeyError issue! For some reason, we can't choose a reviewer under the Reviewers section.

@github-actions github-actions bot added DEV: backend Python, databases, networking, filesystem... SIZE: medium labels Dec 5, 2023
@github-actions github-actions bot removed the DEV: dev-ops Continuous integration & deployment label Dec 5, 2023
@adviti-mishra
Copy link
Contributor Author

We removed our own GitHub Action workflow from the PR and pushed our fixes to the KeyError changes. Thanks for pointing this out since this is our first-ever PR! : ) @bjester @rtibbles

with self.future_job_mapping_lock:
# Check if the job ID already exists in the future_job_mapping dictionary
if job.job_id in self.future_job_mapping:
logger.error(f"Job ID {job.job_id} is already in future_job_mapping.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that we can't use f-strings as we currently still have support for Python 2.7 - will need to use older string formatting methods instead!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reformatted and pushed. thank you!

# Acquire locks before accessing dictionaries
with self.job_future_mapping_lock:
with self.future_job_mapping_lock:
if future in self.job_future_mapping:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More so than the locks, it seems like this line is the one doing most of the work here?

Your test above is not obeying the locks at all, so this existence check is preventing errors from occurring, in which case, this may just work without the locks at all.

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to fix the error, although perhaps doesn't quite get at the root of when it happens. Hopefully the additional logging will help to identify that.

@rtibbles
Copy link
Member

Docs build is being fixed independently.

@rtibbles rtibbles merged commit 9d9166e into learningequality:release-v0.16.x Jan 17, 2024
33 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DEV: backend Python, databases, networking, filesystem... SIZE: medium SIZE: small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants