Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable the cleanup of gDirectory in other threads upon file deletion. #11908

Merged
merged 17 commits into from
Jan 20, 2023

Conversation

pcanal
Copy link
Member

@pcanal pcanal commented Dec 15, 2022

This fixes #11907

Inadvertently a previous commit (79a669b) disabled the ability to cleanup the thread local gDirectory in other
threads when the TFile they pointed to is deleted.

Also fix a set of rare race conditions:

Fix race condition between RegisterContext and gDirectory cleanup.

Description of the race conditions:

(1) thread one create TFile, gDirectory now points to that file.
(2) thread two delete TFile, the destructor calls CleanTargets which has 4 distinct phase
(a) take the TFile spin lock and update all the TContext that points to the file
(b) still hold the spin lock, clean the other thread's directory.
(c) deal with the TContext that were being destructed at the same time
(d) update the local gDirectory

If between (2)(a) and (2)(b), thread (1) starts the creation of a TContext, and
is held at the start of RegisterContext after thread 2 release the spin lock,
thread 1 might awaken only after the TFile object has been deleted and thus
RegisterContext would access delete memory.

If during the destruction of the TFile by thread 2, thread (1) starts the
creation of a TContext, but is suspended right before the start of RegisterContext,
when it comes back it will use deleted memory to try to acquire the spin lock.

@pcanal pcanal requested a review from Axel-Naumann as a code owner December 15, 2022 22:09
@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu18.04/nortcxxmod.
Running on sft-ubuntu-1804-2.cern.ch:/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2022-12-15T22:13:38.591Z] FAILED: core/cont/CMakeFiles/Cont.dir/src/TClonesArray.cxx.o
  • [2022-12-15T22:13:38.847Z] /mnt/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<long unsigned int>::atomic(const std::atomic<long unsigned int>&)’
  • [2022-12-15T22:13:38.847Z] FAILED: core/cont/CMakeFiles/Cont.dir/src/TMap.cxx.o
  • [2022-12-15T22:13:38.847Z] /mnt/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<long unsigned int>::atomic(const std::atomic<long unsigned int>&)’
  • [2022-12-15T22:13:38.847Z] FAILED: core/cont/CMakeFiles/Cont.dir/src/TList.cxx.o
  • [2022-12-15T22:13:38.847Z] /mnt/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<long unsigned int>::atomic(const std::atomic<long unsigned int>&)’
  • [2022-12-15T22:13:38.847Z] FAILED: core/cont/CMakeFiles/Cont.dir/src/TCollection.cxx.o
  • [2022-12-15T22:13:38.847Z] /mnt/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<long unsigned int>::atomic(const std::atomic<long unsigned int>&)’
  • [2022-12-15T22:13:38.847Z] FAILED: core/cont/CMakeFiles/Cont.dir/src/THashTable.cxx.o
  • [2022-12-15T22:13:39.103Z] /mnt/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<long unsigned int>::atomic(const std::atomic<long unsigned int>&)’

And 6 more

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2004/python3.
Running on root-ubuntu-2004-3.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2022-12-15T22:13:53.576Z] FAILED: core/cont/CMakeFiles/Cont.dir/src/TMap.cxx.o
  • [2022-12-15T22:13:53.576Z] /home/sftnight/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<long unsigned int>::atomic(const std::atomic<long unsigned int>&)’
  • [2022-12-15T22:13:53.577Z] FAILED: core/cont/CMakeFiles/Cont.dir/src/TObjArray.cxx.o
  • [2022-12-15T22:13:53.577Z] /home/sftnight/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<long unsigned int>::atomic(const std::atomic<long unsigned int>&)’
  • [2022-12-15T22:13:53.577Z] FAILED: core/cont/CMakeFiles/Cont.dir/src/TClonesArray.cxx.o
  • [2022-12-15T22:13:53.577Z] /home/sftnight/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<long unsigned int>::atomic(const std::atomic<long unsigned int>&)’
  • [2022-12-15T22:13:53.577Z] FAILED: core/cont/CMakeFiles/Cont.dir/src/TObjectTable.cxx.o
  • [2022-12-15T22:13:53.577Z] /home/sftnight/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<long unsigned int>::atomic(const std::atomic<long unsigned int>&)’
  • [2022-12-15T22:13:53.577Z] FAILED: core/gui/CMakeFiles/GuiCore.dir/src/TBrowser.cxx.o
  • [2022-12-15T22:13:53.577Z] /home/sftnight/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<long unsigned int>::atomic(const std::atomic<long unsigned int>&)’

And 18 more

@phsft-bot
Copy link
Collaborator

Build failed on mac11/cxx14.
Running on macphsft23.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2022-12-15T22:34:21.106Z] FAILED: core/CMakeFiles/BaseTROOT.dir/base/src/TROOT.cxx.o
  • [2022-12-15T22:34:21.106Z] /Users/sftnight/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: copying member subobject of type 'std::atomic<size_t>' (aka 'atomic<unsigned long>') invokes deleted constructor

@phsft-bot
Copy link
Collaborator

Build failed on windows10/cxx14.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

@phsft-bot
Copy link
Collaborator

Build failed on mac12/noimt.
Running on macphsft18.dyndns.cern.ch:/Users/sftnight/build/jenkins/workspace/root-pullrequests-build
See console output.

Errors:

  • [2022-12-15T22:50:37.836Z] FAILED: core/textinput/CMakeFiles/TextInput.dir/src/Getline_color.cxx.o
  • [2022-12-15T22:50:37.836Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: copying member subobject of type 'std::atomic<size_t>' (aka 'atomic<unsigned long>') invokes deleted constructor
  • [2022-12-15T22:50:37.836Z] FAILED: core/metacling/src/CMakeFiles/MetaCling.dir/rootclingTCling.cxx.o
  • [2022-12-15T22:50:37.836Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: copying member subobject of type 'std::atomic<size_t>' (aka 'atomic<unsigned long>') invokes deleted constructor
  • [2022-12-15T22:50:37.836Z] FAILED: core/gui/CMakeFiles/GuiCore.dir/src/TGuiFactory.cxx.o
  • [2022-12-15T22:50:37.836Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: copying member subobject of type 'std::atomic<size_t>' (aka 'atomic<unsigned long>') invokes deleted constructor
  • [2022-12-15T22:50:37.836Z] FAILED: core/gui/CMakeFiles/GuiCore.dir/src/TContextMenu.cxx.o
  • [2022-12-15T22:50:37.837Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: copying member subobject of type 'std::atomic<size_t>' (aka 'atomic<unsigned long>') invokes deleted constructor
  • [2022-12-15T22:50:37.837Z] FAILED: core/gui/CMakeFiles/GuiCore.dir/src/TBrowser.cxx.o
  • [2022-12-15T22:50:37.837Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: copying member subobject of type 'std::atomic<size_t>' (aka 'atomic<unsigned long>') invokes deleted constructor

And 14 more

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-debian10-i386/soversion.
Running on pcepsft10.dyndns.cern.ch:/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2022-12-15T22:53:07.771Z] /home/sftnight/build/workspace/root-pullrequests-build/root/core/base/inc/TDirectory.h:160:38: error: use of deleted function ‘std::atomic<unsigned int>::atomic(const std::atomic<unsigned int>&)’

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

@pcanal pcanal marked this pull request as draft December 16, 2022 01:10
@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Collaborator

In the previous implementation at every lookup of the thread local
gDirectory, its value was checked against nullptr and if it was,
it was reset to gROOT.

This disabled the pattern of setting gDirectory to nullptr to
prevent any registration of newly create Histogram or TTree.
This pattern is used for example in TGraph2D::GetHistogram.

The net result was that the pattern work perfectly if it was
run on the main thread but failed if it was run in a thread.

For example, in the case of TGraph2D::GetHistogram, when this
was called on a thread, the ownership was shared between the
TGraph2D and gROOT but TGraph2D does not implement RecursiveRemove
(since it does not need to, since it does not expect shared
ownership) and thus there was case of double deletion or access
after delete of the histogram.
We now apply the same algorithm for the reset value of
other's thread gDirectory (point to 'this' file) as we do
for the local gDirectory.
Separate the 2 implementation so that the parameter less implementation can focus on the gDirectory update
while the implementation taking a path parameter can focus in the path search (and then call the other one).
@pcanal
Copy link
Member Author

pcanal commented Jan 13, 2023

@phsft-bot build

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

Copy link
Member

@Axel-Naumann Axel-Naumann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What an impressively complex system...

core/base/inc/TDirectory.h Outdated Show resolved Hide resolved
namespace ROOT {
namespace Internal {
struct TDirectoryAtomicAdapter {
std::atomic<TDirectory*> &fValue;
TDirectoryAtomicAdapter(std::atomic<TDirectory*> &value) : fValue(value) {}
TDirectory::SharedGDirectory_t &fValue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you know that the referenced shared_ptr has a sufficient lifetime? Given the introduction of shared state, it might make sense to include the main user-facing interface (gDirectory) in the lifetime management?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might make sense to include the main user-facing interface

I am not sure what you mean ...

How can you know that the referenced shared_ptr has a sufficient lifetime?

The lifetime of the shared_ptr is that of the thread. The TDirectoryAtomicAdapter are intended to be "temporary-only" object. Indeed it would be a disaster if the user uses this Internal type as a thread_local or static variable or anything that might have a lifetime longer than the thread that created it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lifetime of the shared_ptr is that of the thread.

That's super useful information, thanks, I missed that! Maybe worth calling out in a comment?

The TDirectoryAtomicAdapter are intended to be "temporary-only" object.

I see the following code (line 376):

#define gDirectory (ROOT::Internal::TDirectoryAtomicAdapter{})

So TDirectoryAtomicAdapter is user facing, because gDirectory is commonly used.

Is

auto myDirToBePassedToOtherThread = gDirectory;

fine? I guess the key here is the passing-to-other-thread, which would always happen as TDirectory* and not auto?

Can this line be problematic if there is task nesting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And btw

#define gDirectory (ROOT::Internal::TDirectoryAtomicAdapter{})

also makes the Internal namespace a bit dubious...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#define gDirectory (ROOT::Internal::TDirectoryAtomicAdapter{})
also makes the Internal namespace a bit dubious...

I think it clearly mark the intention that even-though the macro gDirectory is implemented in term of that class, user should still never ever use it directly.

I guess the key here is the passing-to-other-thread, which would always happen as TDirectory* and not auto?

Right. Unless the user explicitly use the Internal class as a parameter or data member, then it should not be a problem. (I suppose an implicit way would be to use decltype(gDirectory) but I am not any extra complexity added to support (and/or explicitly prevent) this case would be worth it).

Can this line be problematic if there is task nesting?

Yes absolutely, but not because of any of the implementation details. To support gDirectory in a concurrent environment we elected to make it (semantically and now literally) thread-local. Because it is thread-local, it can not and should be rely upon if the user is using a task based environment (we can not know from within TFile/TDirectory if it is the case or not).

Is auto myDirToBePassedToOtherThread = gDirectory;
fine?

It is sorta fine (but the sorta is pre-existing this PR). Since the lifetime of that object is the that of its function's execution which is by definition shorter than the lifetime of the thread. I wish that even in this case the type used by auto would be TDirectory*, in part because the above code is the moral equivalent of (if gDirectory was still a static TDirectory*) TDirectory *& localVar = gDirectory ... that is an issue but it should be tackled in a different PR.

for(auto ptr : fGDirectories) {
if (ptr->load() == this) {
(*ptr) = nullptr;
for (auto &ptr : fGDirectories) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this part. This is very complex, and the amount of comments that explain intent and design of this seems inadequate for people like me trying to understand this.

core/base/src/TDirectory.cxx Show resolved Hide resolved
core/base/src/TDirectory.cxx Show resolved Hide resolved
core/base/src/TDirectory.cxx Outdated Show resolved Hide resolved
@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on mac11/cxx14.
Running on macphsft23.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

It was only lock the access to a atomic that since it is held by a shared pointer is guarantee to exist
when it was used in connjunction with the lock
@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu18.04/nortcxxmod, ROOT-ubuntu2004/python3, mac12/noimt, mac11/cxx14, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on mac11/cxx14.
Running on macphsft20.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Errors:

  • [2023-01-19T19:44:11.181Z] FAILED: VECCORE-prefix/src/VECCORE-stamp/VECCORE-download /Users/sftnight/build/workspace/root-pullrequests-build/build/VECCORE-prefix/src/VECCORE-stamp/VECCORE-download
  • [2023-01-19T19:44:11.181Z] CMake Error at VECCORE-stamp/VECCORE-download-Release.cmake:49 (message):
  • [2023-01-19T19:44:11.952Z] FAILED: VC-prefix/src/VC-stamp/VC-download /Users/sftnight/build/workspace/root-pullrequests-build/build/VC-prefix/src/VC-stamp/VC-download
  • [2023-01-19T19:44:11.952Z] CMake Error at VC-stamp/VC-download-Release.cmake:49 (message):
  • [2023-01-19T19:44:13.320Z] FAILED: CFITSIO-prefix/src/CFITSIO-stamp/CFITSIO-download /Users/sftnight/build/workspace/root-pullrequests-build/build/CFITSIO-prefix/src/CFITSIO-stamp/CFITSIO-download
  • [2023-01-19T19:44:13.320Z] CMake Error at CFITSIO-stamp/CFITSIO-download-Release.cmake:49 (message):

@pcanal
Copy link
Member Author

pcanal commented Jan 20, 2023

The one failure is unrelated (externals not downloading).

@pcanal pcanal merged commit 8199acd into root-project:master Jan 20, 2023
@pcanal pcanal deleted the master-11907 branch January 20, 2023 13:11
@hahnjo
Copy link
Member

hahnjo commented Jan 23, 2023

The AddressSanitizer build seems quite unhappy about this: https://lcgapp-services.cern.ch/root-jenkins/job/root-asan/LABEL=ROOT-centos8,SPEC=asan,V=master/1037/console. Also we have lots of failures in treetreeplayertestUnit since this merge... (cc @Axel-Naumann; I believe this change also went into 6.28, which could / should potentially block the release)

@Axel-Naumann
Copy link
Member

@pcanal I'll wait for your verdict - as Jonas points out, the sooner the better!

@pcanal
Copy link
Member Author

pcanal commented Jan 23, 2023

It is fixed, see #12090 which had been already added as part of the v6.28 PR.

@hahnjo
Copy link
Member

hahnjo commented Jan 23, 2023

It is fixed, see #12090 which had been already added as part of the v6.28 PR.

Ok. We could avoid such problems by first merging everything into master and then backporting...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

thread local gDirectory not properly updated when another delete the file its point to.
4 participants