Put error global variables into thread-local storage #112

shym · 2023-02-20T13:56:31Z

Global variable error_buffer is used to store a string that is returned to callers, so there is a race condition if dynamic linking is invoked from 2 OCaml domains in parallel
Since the error message must be returned, a mutex cannot be used to prevent the race condition
GNU libc uses that same solution: keep the last error in thread-local storage; so support for calling dlerror from a different thread than the one calling dlopen is not to be expected

The problem with parallel accesses was discovered investigating segfaults in ocaml-multicore/multicoretests#290.
With this patch, these tests do not segfault.
I see the following 4 dynlink tests from the ocaml test suite failing, but they seem to fail whether or not this patch is applied and the error message seems not related.

> run_win32.c:365: CreateProcess failed: The system cannot find the file specified.
[…]
List of failed tests:
    tests/lib-dynlink-csharp /'main.ml' with 1.1.4.1.1.1.1 (script)
    tests/lib-dynlink-csharp /'main.ml' with 1.1.3.1.1.1 (script)
    tests/lib-dynlink-csharp /'main.ml' with 1.1.2.1.1.1.1 (script)
    tests/lib-dynlink-csharp /'main.ml' with 1.1.1.1.1.1 (script)

dra27 · 2023-03-02T17:08:29Z

(closed and re-opened to recompute the merge commit now that #115 is merged, so CI should have something useful to say!)

flexdll.c

dra27 · 2023-03-03T08:53:06Z

Hmm, this is looking tedious... it looks like we're pulling in a runtime library somewhere from the error. I'm guessing that /usr/i686-w64-mingw32/sys-root/mingw/bin (and the x64 equivalent) need adding to PATH for this to work, although we should nail down exactly which DLL it's trying to pull in. That's a bit too heavy for this - we might instead need to hand-roll the native Windows version using TlsAlloc et al.

Pack together the current error code and message into a single structure Ease the transition to putting the error into thread-local storage

shym · 2023-03-21T12:22:21Z

I updated the PR with an implementation using Tls* functions. This turned out a bit more involved than what I first thought, since TlsGetValue modifies the result of GetLastError which we want to preserve. I hope the comments are enough to clarify the implementation choices.

Reproducing locally the test that failed, I get a pop-up saying libwinpthreads-1.dll is missing, maybe too big a dependency for that single feature.

nojb · 2023-04-15T16:22:22Z

Hello @shym: just a heads-up that I am planning to read this PR soon. Thank you for your patience!

nojb

Thank you for this patch! The code is extremely clear and a pleasure to review :)

LGTM (modulo a small question)

flexdll.c

shym · 2023-04-19T10:08:54Z

Thanks you for your kind and thorough review!
I've updated the branch removing the goto.
After looking at another PR, I added an entry to CHANGES.

nojb · 2023-04-19T12:36:09Z

flexdll.c

@@ -357,9 +451,13 @@ static void *find_symbol_global(void *data, const char *name) {
 }

 int flexdll_relocate(void *tbl) {
+  err_t * err;
+  err = get_tls_error(TLS_ERROR_RESET_LAST);
+  if(err == NULL) return 0;


Thinking more about this, shouldn't we reset err->code = 0 here? Otherwise, the check below in line 460 will fail if this function is called after another function that has set err->code.

Going one step further, perhaps when in TLS_ERROR_RESET_LAST mode, we should always set err->code = 0. Or is there a case where we want to reset one of the error codes, but not the other?

Very good point, thank you very much!

I think I ended up with that code because it was not explicitly reset in the original code. That was arguably correct because flexdll_relocate is called from two places (if I didn't miss any other):

from flexdll_dlopen where code has already been reset,

from flexdll_init where code has been set to its initial value 0.
This made me realize that I had forgotten to initialize the values when they are malloc-ed!

So I’ve rewritten the code so that:

on TLS_ERROR_RESET, both code and last_error are reset (so no _LAST),

the explicit reset of code near the call to get_tls_error(TLS_ERROR_RESET) are removed,

the structure is initialized right after malloc, just to make sure; the structure should be malloc-ed on a call to one of the initialisation entrypoints, in which case it will be initialised again just a few lines later, but that will ensure that a buggy program calling dlerror without a previous call to dlopen will get a reliable reasonable behaviour.

nojb

Sorry for the back-and-forth, but it turns out that there is a function SetLastError https://learn.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-setlasterror
Couldn't we use that instead to restore the result of the call to GetLastError after calling the Tls* functions? It should simplify the code (no need for the last_error field).

Still polishing the patch

shym · 2023-04-21T09:36:51Z

Very good idea indeed!
I noticed that the documentation for SetLastError explicit states that the last error is stored in TLS and that values with bit 29 set are reserved for user errors. But I didn’t find a trick to reuse that to fully skip using TLS explicitly ourselves, especially since POSIX’s dlerror must report the last error of a dl function, so other functions must not interfere with the result it will report.
So I just updated the patch removing the last_error field and merging all the non-resetting behaviours. It’s a lot simpler to read.

nojb

Looks good to me, thanks!

nojb · 2023-04-21T09:44:48Z

flexdll.c

-  switch (error) {
+  err_t * err;
+  err = get_tls_error(TLS_ERROR_NOP);
+  if(err == NULL) return "error in accessing thread-local storage";


Suggested change

if(err == NULL) return "error in accessing thread-local storage";

if(err == NULL) return "error accessing thread-local storage";

nojb · 2023-04-21T09:45:10Z

flexdll.c

+  DWORD msglen;
+  err_t * err;
+  err = get_tls_error(TLS_ERROR_NOP);
+  if(err == NULL) return "error in accessing thread-local storage";


Suggested change

if(err == NULL) return "error in accessing thread-local storage";

if(err == NULL) return "error accessing thread-local storage";

Move the last error into thread-local storage to avoid data races (and thus possible segmentation faults) when the code is used in a multithreaded setting Add a get_tls_error function to access explicitly the thread-local error to bypass limited compiler support for it (`__thread`, etc.) Pass explicitly the current error variable to internal functions to avoid calling get_tls_error when possible Document the mechanism used for TLS errors, to explain its unexpected complexity As a side-effect of that reorganisation of the code, the code of the last error is explicitly reset on all initialisation entry points (flexdll_dlopen, flexdll_wdlopen, flexdll_relocate), even when it was missing before Co-authored-by: Nicolás Ojeda Bär <[email protected]>

shym · 2023-04-21T10:11:48Z

Just changed the error message, and added due credit! 😄

nojb · 2023-04-21T11:20:35Z

Thanks, merged! (I took the liberty of squashing the commits into a single commit; this makes it easier to revert, cherry-pick, etc.)

dra27 closed this Mar 2, 2023

dra27 reopened this Mar 2, 2023

dra27 reviewed Mar 2, 2023

View reviewed changes

flexdll.c Outdated Show resolved Hide resolved

shym force-pushed the error-in-tls branch from 611d5c5 to 4e88227 Compare March 3, 2023 07:49

Group error code and message into a structure

9b046a5

Pack together the current error code and message into a single structure Ease the transition to putting the error into thread-local storage

shym force-pushed the error-in-tls branch from 4e88227 to 6a5438a Compare March 21, 2023 11:13

nojb previously approved these changes Apr 18, 2023

View reviewed changes

flexdll.c Outdated Show resolved Hide resolved

shym force-pushed the error-in-tls branch from 6a5438a to 95473e8 Compare April 19, 2023 09:57

nojb reviewed Apr 19, 2023

View reviewed changes

shym force-pushed the error-in-tls branch from 0574f2a to 41e65c6 Compare April 19, 2023 16:57

nojb reviewed Apr 20, 2023

View reviewed changes

shym force-pushed the error-in-tls branch from 41e65c6 to e555a84 Compare April 21, 2023 09:35

nojb approved these changes Apr 21, 2023

View reviewed changes

shym and others added 2 commits April 21, 2023 12:08

Add a CHANGES entry

fe87994

shym force-pushed the error-in-tls branch from e555a84 to fe87994 Compare April 21, 2023 10:11

nojb merged commit bae7593 into ocaml:master Apr 21, 2023

shym deleted the error-in-tls branch April 21, 2023 12:48

jmid mentioned this pull request Mar 15, 2024

[ocaml5-issue] Deadlock in Dynlink test on Cygwin+MinGW+MSVC ocaml-multicore/multicoretests#307

Open

jmid mentioned this pull request Mar 22, 2024

Parallel Dynlink usage under Cygwin+MinGW is unsafe ocaml/ocaml#13046

Open

jmid mentioned this pull request Apr 16, 2024

Fix parallel access crashes and misbehavior #136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put error global variables into thread-local storage #112

Put error global variables into thread-local storage #112

shym commented Feb 20, 2023

dra27 commented Mar 2, 2023

dra27 commented Mar 3, 2023

shym commented Mar 21, 2023

nojb commented Apr 15, 2023

nojb left a comment

shym commented Apr 19, 2023 •

edited

Loading

nojb Apr 19, 2023

shym Apr 19, 2023

nojb left a comment

shym commented Apr 21, 2023

nojb left a comment

nojb Apr 21, 2023

nojb Apr 21, 2023

shym commented Apr 21, 2023

nojb commented Apr 21, 2023

	if(err == NULL) return "error in accessing thread-local storage";
	if(err == NULL) return "error accessing thread-local storage";

Put error global variables into thread-local storage #112

Put error global variables into thread-local storage #112

Conversation

shym commented Feb 20, 2023

dra27 commented Mar 2, 2023

dra27 commented Mar 3, 2023

shym commented Mar 21, 2023

nojb commented Apr 15, 2023

nojb left a comment

Choose a reason for hiding this comment

shym commented Apr 19, 2023 • edited Loading

nojb Apr 19, 2023

Choose a reason for hiding this comment

shym Apr 19, 2023

Choose a reason for hiding this comment

nojb left a comment

Choose a reason for hiding this comment

shym commented Apr 21, 2023

nojb left a comment

Choose a reason for hiding this comment

nojb Apr 21, 2023

Choose a reason for hiding this comment

nojb Apr 21, 2023

Choose a reason for hiding this comment

shym commented Apr 21, 2023

nojb commented Apr 21, 2023

shym commented Apr 19, 2023 •

edited

Loading