Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Core] Fix potential race condition when modifying os.environ #20

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion python/ray/_raylet.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -2228,7 +2228,26 @@ cdef execute_task_with_cancellation_handler(

if omp_num_threads_overriden:
# Reset the OMP_NUM_THREADS environ if it was set.
os.environ.pop("OMP_NUM_THREADS", None)
try:
os.environ.pop("OMP_NUM_THREADS", None)
except KeyError as e:
# os.environ is known to have undefined behavior if multiple
# threads is trying to modify the mapping simultaneously.
# Here specifically, race condition could happen when two threads
# poping the environ in the same time, where one of them deleting
# the key successfully, but the other met KeyError. If we met this
# issue, we just ensure the the current os.environ do not contain
# this env var and move on.
# Related issue: https://github.com/python/cpython/issues/120513
logger.error(
"KeyError occurred when popping OMP_NUM_THREADS from "
"os.environ, a possible race condition might happened. "
"Checking if the env var is already cleaned up and move on."
)
if os.environ.get("OMP_NUM_THREADS", None) is not None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this happen? Only condition on top of my mind is it got added back by another thread, but that thread will eventually pop it out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to prevent this from happening is to introduce a lock on any access to os.environ. If we are not doing so then there is always going to be a small chance that this could happen. It is hard for me to come up with an exhausted list. But adding a logging here will make sure this is easier to know in the future.

logger.error(
"Something wrong happened, OMP_NUM_THREADS still remains in os.environ"
)

if execution_info.max_calls != 0:
# Reset the state of the worker for the next task to execute.
Expand Down
Loading