-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reenable the cleanup of gDirectory in other threads upon file deletion. #11908
Conversation
Starting build on |
Build failed on ROOT-ubuntu18.04/nortcxxmod. Errors:
And 6 more |
Build failed on ROOT-ubuntu2004/python3. Errors:
And 18 more |
Build failed on mac11/cxx14. Errors:
|
Build failed on windows10/cxx14. |
Build failed on mac12/noimt. Errors:
And 14 more |
Build failed on ROOT-debian10-i386/soversion. Errors:
|
Build failed on ROOT-performance-centos8-multicore/cxx17. Failing tests:
And 9 more |
Starting build on |
Starting build on |
Build failed on windows10/cxx14. Failing tests:
|
Build failed on ROOT-performance-centos8-multicore/cxx17. Failing tests:
And 11 more |
Build failed on ROOT-ubuntu2004/python3. Failing tests:
And 4 more |
Starting build on |
Build failed on ROOT-performance-centos8-multicore/cxx17. Failing tests: |
In the previous implementation at every lookup of the thread local gDirectory, its value was checked against nullptr and if it was, it was reset to gROOT. This disabled the pattern of setting gDirectory to nullptr to prevent any registration of newly create Histogram or TTree. This pattern is used for example in TGraph2D::GetHistogram. The net result was that the pattern work perfectly if it was run on the main thread but failed if it was run in a thread. For example, in the case of TGraph2D::GetHistogram, when this was called on a thread, the ownership was shared between the TGraph2D and gROOT but TGraph2D does not implement RecursiveRemove (since it does not need to, since it does not expect shared ownership) and thus there was case of double deletion or access after delete of the histogram.
We now apply the same algorithm for the reset value of other's thread gDirectory (point to 'this' file) as we do for the local gDirectory.
Separate the 2 implementation so that the parameter less implementation can focus on the gDirectory update while the implementation taking a path parameter can focus in the path search (and then call the other one).
@phsft-bot build |
Starting build on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What an impressively complex system...
namespace ROOT { | ||
namespace Internal { | ||
struct TDirectoryAtomicAdapter { | ||
std::atomic<TDirectory*> &fValue; | ||
TDirectoryAtomicAdapter(std::atomic<TDirectory*> &value) : fValue(value) {} | ||
TDirectory::SharedGDirectory_t &fValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can you know that the referenced shared_ptr
has a sufficient lifetime? Given the introduction of shared state, it might make sense to include the main user-facing interface (gDirectory
) in the lifetime management?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might make sense to include the main user-facing interface
I am not sure what you mean ...
How can you know that the referenced shared_ptr has a sufficient lifetime?
The lifetime of the shared_ptr
is that of the thread. The TDirectoryAtomicAdapter
are intended to be "temporary-only" object. Indeed it would be a disaster if the user uses this Internal type as a thread_local
or static
variable or anything that might have a lifetime longer than the thread that created it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lifetime of the shared_ptr is that of the thread.
That's super useful information, thanks, I missed that! Maybe worth calling out in a comment?
The TDirectoryAtomicAdapter are intended to be "temporary-only" object.
I see the following code (line 376):
#define gDirectory (ROOT::Internal::TDirectoryAtomicAdapter{})
So TDirectoryAtomicAdapter
is user facing, because gDirectory
is commonly used.
Is
auto myDirToBePassedToOtherThread = gDirectory;
fine? I guess the key here is the passing-to-other-thread, which would always happen as TDirectory*
and not auto
?
Can this line be problematic if there is task nesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And btw
#define gDirectory (ROOT::Internal::TDirectoryAtomicAdapter{})
also makes the Internal
namespace a bit dubious...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define gDirectory (ROOT::Internal::TDirectoryAtomicAdapter{})
also makes the Internal namespace a bit dubious...
I think it clearly mark the intention that even-though the macro gDirectory
is implemented in term of that class, user should still never ever use it directly.
I guess the key here is the passing-to-other-thread, which would always happen as TDirectory* and not auto?
Right. Unless the user explicitly use the Internal
class as a parameter or data member, then it should not be a problem. (I suppose an implicit way would be to use decltype(gDirectory)
but I am not any extra complexity added to support (and/or explicitly prevent) this case would be worth it).
Can this line be problematic if there is task nesting?
Yes absolutely, but not because of any of the implementation details. To support gDirectory
in a concurrent environment we elected to make it (semantically and now literally) thread-local
. Because it is thread-local
, it can not and should be rely upon if the user is using a task based environment (we can not know from within TFile/TDirectory
if it is the case or not).
Is
auto myDirToBePassedToOtherThread = gDirectory;
fine?
It is sorta fine (but the sorta is pre-existing this PR). Since the lifetime of that object is the that of its function's execution which is by definition shorter than the lifetime of the thread. I wish that even in this case the type used by auto would be TDirectory*
, in part because the above code is the moral equivalent of (if gDirectory was still a static TDirectory*
) TDirectory *& localVar = gDirectory
... that is an issue but it should be tackled in a different PR.
for(auto ptr : fGDirectories) { | ||
if (ptr->load() == this) { | ||
(*ptr) = nullptr; | ||
for (auto &ptr : fGDirectories) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this part. This is very complex, and the amount of comments that explain intent and design of this seems inadequate for people like me trying to understand this.
Starting build on |
Build failed on mac11/cxx14. Failing tests: |
Starting build on |
It was only lock the access to a atomic that since it is held by a shared pointer is guarantee to exist when it was used in connjunction with the lock
Starting build on |
Starting build on |
Build failed on mac11/cxx14. Errors:
|
The one failure is unrelated (externals not downloading). |
The AddressSanitizer build seems quite unhappy about this: https://lcgapp-services.cern.ch/root-jenkins/job/root-asan/LABEL=ROOT-centos8,SPEC=asan,V=master/1037/console. Also we have lots of failures in |
@pcanal I'll wait for your verdict - as Jonas points out, the sooner the better! |
It is fixed, see #12090 which had been already added as part of the v6.28 PR. |
Ok. We could avoid such problems by first merging everything into |
This fixes #11907
Inadvertently a previous commit (79a669b) disabled the ability to cleanup the thread local gDirectory in other
threads when the TFile they pointed to is deleted.
Also fix a set of rare race conditions: