-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put error global variables into thread-local storage #112
Conversation
(closed and re-opened to recompute the merge commit now that #115 is merged, so CI should have something useful to say!) |
Hmm, this is looking tedious... it looks like we're pulling in a runtime library somewhere from the error. I'm guessing that |
Pack together the current error code and message into a single structure Ease the transition to putting the error into thread-local storage
I updated the PR with an implementation using Reproducing locally the test that failed, I get a pop-up saying |
Hello @shym: just a heads-up that I am planning to read this PR soon. Thank you for your patience! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this patch! The code is extremely clear and a pleasure to review :)
LGTM (modulo a small question)
Thanks you for your kind and thorough review! |
@@ -357,9 +451,13 @@ static void *find_symbol_global(void *data, const char *name) { | |||
} | |||
|
|||
int flexdll_relocate(void *tbl) { | |||
err_t * err; | |||
err = get_tls_error(TLS_ERROR_RESET_LAST); | |||
if(err == NULL) return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, shouldn't we reset err->code = 0
here? Otherwise, the check below in line 460 will fail if this function is called after another function that has set err->code
.
Going one step further, perhaps when in TLS_ERROR_RESET_LAST
mode, we should always set err->code = 0
. Or is there a case where we want to reset one of the error codes, but not the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point, thank you very much!
I think I ended up with that code because it was not explicitly reset in the original code. That was arguably correct because flexdll_relocate
is called from two places (if I didn't miss any other):
- from
flexdll_dlopen
wherecode
has already been reset, - from
flexdll_init
wherecode
has been set to its initial value 0.
This made me realize that I had forgotten to initialize the values when they are malloc-ed!
So I’ve rewritten the code so that:
- on
TLS_ERROR_RESET
, bothcode
andlast_error
are reset (so no_LAST
), - the explicit reset of
code
near the call toget_tls_error(TLS_ERROR_RESET)
are removed, - the structure is initialized right after
malloc
, just to make sure; the structure should be malloc-ed on a call to one of the initialisation entrypoints, in which case it will be initialised again just a few lines later, but that will ensure that a buggy program callingdlerror
without a previous call todlopen
will get a reliable reasonable behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the back-and-forth, but it turns out that there is a function SetLastError
https://learn.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-setlasterror
Couldn't we use that instead to restore the result of the call to GetLastError
after calling the Tls*
functions? It should simplify the code (no need for the last_error
field).
Very good idea indeed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks!
flexdll.c
Outdated
switch (error) { | ||
err_t * err; | ||
err = get_tls_error(TLS_ERROR_NOP); | ||
if(err == NULL) return "error in accessing thread-local storage"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(err == NULL) return "error in accessing thread-local storage"; | |
if(err == NULL) return "error accessing thread-local storage"; |
flexdll.c
Outdated
DWORD msglen; | ||
err_t * err; | ||
err = get_tls_error(TLS_ERROR_NOP); | ||
if(err == NULL) return "error in accessing thread-local storage"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(err == NULL) return "error in accessing thread-local storage"; | |
if(err == NULL) return "error accessing thread-local storage"; |
Move the last error into thread-local storage to avoid data races (and thus possible segmentation faults) when the code is used in a multithreaded setting Add a get_tls_error function to access explicitly the thread-local error to bypass limited compiler support for it (`__thread`, etc.) Pass explicitly the current error variable to internal functions to avoid calling get_tls_error when possible Document the mechanism used for TLS errors, to explain its unexpected complexity As a side-effect of that reorganisation of the code, the code of the last error is explicitly reset on all initialisation entry points (flexdll_dlopen, flexdll_wdlopen, flexdll_relocate), even when it was missing before Co-authored-by: Nicolás Ojeda Bär <[email protected]>
Just changed the error message, and added due credit! 😄 |
Thanks, merged! (I took the liberty of squashing the commits into a single commit; this makes it easier to revert, cherry-pick, etc.) |
Global variable
error_buffer
is used to store a string that is returned to callers, so there is a race condition if dynamic linking is invoked from 2 OCaml domains in parallelSince the error message must be returned, a mutex cannot be used to prevent the race condition
GNU libc uses that same solution: keep the last error in thread-local storage; so support for calling dlerror from a different thread than the one calling dlopen is not to be expected
The problem with parallel accesses was discovered investigating segfaults in ocaml-multicore/multicoretests#290.
With this patch, these tests do not segfault.
I see the following 4 dynlink tests from the ocaml test suite failing, but they seem to fail whether or not this patch is applied and the error message seems not related.