Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put error global variables into thread-local storage #112

Merged
merged 3 commits into from
Apr 21, 2023

Conversation

shym
Copy link
Contributor

@shym shym commented Feb 20, 2023

Global variable error_buffer is used to store a string that is returned to callers, so there is a race condition if dynamic linking is invoked from 2 OCaml domains in parallel
Since the error message must be returned, a mutex cannot be used to prevent the race condition
GNU libc uses that same solution: keep the last error in thread-local storage; so support for calling dlerror from a different thread than the one calling dlopen is not to be expected

The problem with parallel accesses was discovered investigating segfaults in ocaml-multicore/multicoretests#290.
With this patch, these tests do not segfault.
I see the following 4 dynlink tests from the ocaml test suite failing, but they seem to fail whether or not this patch is applied and the error message seems not related.

> run_win32.c:365: CreateProcess failed: The system cannot find the file specified.
[…]
List of failed tests:
    tests/lib-dynlink-csharp /'main.ml' with 1.1.4.1.1.1.1 (script)
    tests/lib-dynlink-csharp /'main.ml' with 1.1.3.1.1.1 (script)
    tests/lib-dynlink-csharp /'main.ml' with 1.1.2.1.1.1.1 (script)
    tests/lib-dynlink-csharp /'main.ml' with 1.1.1.1.1.1 (script)

@dra27 dra27 closed this Mar 2, 2023
@dra27 dra27 reopened this Mar 2, 2023
@dra27
Copy link
Member

dra27 commented Mar 2, 2023

(closed and re-opened to recompute the merge commit now that #115 is merged, so CI should have something useful to say!)

flexdll.c Outdated Show resolved Hide resolved
@dra27
Copy link
Member

dra27 commented Mar 3, 2023

Hmm, this is looking tedious... it looks like we're pulling in a runtime library somewhere from the error. I'm guessing that /usr/i686-w64-mingw32/sys-root/mingw/bin (and the x64 equivalent) need adding to PATH for this to work, although we should nail down exactly which DLL it's trying to pull in. That's a bit too heavy for this - we might instead need to hand-roll the native Windows version using TlsAlloc et al.

Pack together the current error code and message into a single structure
Ease the transition to putting the error into thread-local storage
@shym
Copy link
Contributor Author

shym commented Mar 21, 2023

I updated the PR with an implementation using Tls* functions. This turned out a bit more involved than what I first thought, since TlsGetValue modifies the result of GetLastError which we want to preserve. I hope the comments are enough to clarify the implementation choices.

Reproducing locally the test that failed, I get a pop-up saying libwinpthreads-1.dll is missing, maybe too big a dependency for that single feature.

@nojb
Copy link
Collaborator

nojb commented Apr 15, 2023

Hello @shym: just a heads-up that I am planning to read this PR soon. Thank you for your patience!

nojb
nojb previously approved these changes Apr 18, 2023
Copy link
Collaborator

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this patch! The code is extremely clear and a pleasure to review :)

LGTM (modulo a small question)

flexdll.c Outdated Show resolved Hide resolved
@shym
Copy link
Contributor Author

shym commented Apr 19, 2023

Thanks you for your kind and thorough review!
I've updated the branch removing the goto.
After looking at another PR, I added an entry to CHANGES.

@@ -357,9 +451,13 @@ static void *find_symbol_global(void *data, const char *name) {
}

int flexdll_relocate(void *tbl) {
err_t * err;
err = get_tls_error(TLS_ERROR_RESET_LAST);
if(err == NULL) return 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, shouldn't we reset err->code = 0 here? Otherwise, the check below in line 460 will fail if this function is called after another function that has set err->code.

Going one step further, perhaps when in TLS_ERROR_RESET_LAST mode, we should always set err->code = 0. Or is there a case where we want to reset one of the error codes, but not the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, thank you very much!

I think I ended up with that code because it was not explicitly reset in the original code. That was arguably correct because flexdll_relocate is called from two places (if I didn't miss any other):

  • from flexdll_dlopen where code has already been reset,
  • from flexdll_init where code has been set to its initial value 0.
    This made me realize that I had forgotten to initialize the values when they are malloc-ed!

So I’ve rewritten the code so that:

  • on TLS_ERROR_RESET, both code and last_error are reset (so no _LAST),
  • the explicit reset of code near the call to get_tls_error(TLS_ERROR_RESET) are removed,
  • the structure is initialized right after malloc, just to make sure; the structure should be malloc-ed on a call to one of the initialisation entrypoints, in which case it will be initialised again just a few lines later, but that will ensure that a buggy program calling dlerror without a previous call to dlopen will get a reliable reasonable behaviour.

Copy link
Collaborator

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the back-and-forth, but it turns out that there is a function SetLastError https://learn.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-setlasterror
Couldn't we use that instead to restore the result of the call to GetLastError after calling the Tls* functions? It should simplify the code (no need for the last_error field).

@nojb nojb dismissed their stale review April 20, 2023 11:54

Still polishing the patch

@shym
Copy link
Contributor Author

shym commented Apr 21, 2023

Very good idea indeed!
I noticed that the documentation for SetLastError explicit states that the last error is stored in TLS and that values with bit 29 set are reserved for user errors. But I didn’t find a trick to reuse that to fully skip using TLS explicitly ourselves, especially since POSIX’s dlerror must report the last error of a dl function, so other functions must not interfere with the result it will report.
So I just updated the patch removing the last_error field and merging all the non-resetting behaviours. It’s a lot simpler to read.

Copy link
Collaborator

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

flexdll.c Outdated
switch (error) {
err_t * err;
err = get_tls_error(TLS_ERROR_NOP);
if(err == NULL) return "error in accessing thread-local storage";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(err == NULL) return "error in accessing thread-local storage";
if(err == NULL) return "error accessing thread-local storage";

flexdll.c Outdated
DWORD msglen;
err_t * err;
err = get_tls_error(TLS_ERROR_NOP);
if(err == NULL) return "error in accessing thread-local storage";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(err == NULL) return "error in accessing thread-local storage";
if(err == NULL) return "error accessing thread-local storage";

shym and others added 2 commits April 21, 2023 12:08
Move the last error into thread-local storage to avoid data races (and
thus possible segmentation faults) when the code is used in a
multithreaded setting

Add a get_tls_error function to access explicitly the thread-local error
to bypass limited compiler support for it (`__thread`, etc.)
Pass explicitly the current error variable to internal functions to
avoid calling get_tls_error when possible
Document the mechanism used for TLS errors, to explain its unexpected
complexity

As a side-effect of that reorganisation of the code, the code of the
last error is explicitly reset on all initialisation entry points
(flexdll_dlopen, flexdll_wdlopen, flexdll_relocate), even when it was
missing before

Co-authored-by: Nicolás Ojeda Bär <[email protected]>
@shym
Copy link
Contributor Author

shym commented Apr 21, 2023

Just changed the error message, and added due credit! 😄

@nojb nojb merged commit bae7593 into ocaml:master Apr 21, 2023
@nojb
Copy link
Collaborator

nojb commented Apr 21, 2023

Thanks, merged! (I took the liberty of squashing the commits into a single commit; this makes it easier to revert, cherry-pick, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants