-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New dictionary-related crash in 6.24.00-patches #7657
Comments
Note that we observed the same type of crash in some other places as well in ATLAS, which could also be handled by cleaning up the dictionaries of the affected libraries. So the error has an upside as well. (Though a simple error message from ROOT would've been more welcome. 😛) (I mainly just wrote this comment to get notifications about this issue's evolution... 😉) |
What a fantastic bug report, thank you SO much, @scott-snyder ! I'll have it fixed in the coming days / hopefully tomorrow. I'll keep you posted. (Just stating the obvious: this is blocking v6.24/00.) |
When a Transaction is unloaded for which *also* a ScopedTransactionRAII is waiting, the latter will potentially access a deleted Transaction. It can be deleted due to the Pool being full. Even if it is not deleted (as is the case in root-project#7657 ) it will get pushed into the Pool once by unload() and a second time by the RAII! The two options to track this case were full-blown ref counting or noting that a Transaction is attached to an RAII scope. If that is the case, freeing the Transaction must be skipped, as the RAII will take care of it.
When a Transaction is unloaded for which *also* a ScopedTransactionRAII is waiting, the latter will potentially access a deleted Transaction. It can be deleted due to the Pool being full. Even if it is not deleted (as is the case in root-project#7657 ) it will get pushed into the Pool once by unload() and a second time by the RAII! The two options to track this case were full-blown ref counting or noting that a Transaction is attached to an RAII scope. If that is the case, freeing the Transaction must be skipped, as the RAII will take care of it.
When a Transaction is unloaded for which *also* a ScopedTransactionRAII is waiting, the latter will potentially access a deleted Transaction. It can be deleted due to the Pool being full. Even if it is not deleted (as is the case in root-project#7657 ) it will get pushed into the Pool once by unload() and a second time by the RAII! The two options to track this case were full-blown ref counting or noting that a Transaction is attached to an RAII scope. If that is the case, freeing the Transaction must be skipped, as the RAII will take care of it.
a cheap way to notice what went wrong in root-project#7657.
When a Transaction is unloaded for which *also* a ScopedTransactionRAII is waiting, the latter will potentially access a deleted Transaction. It can be deleted due to the Pool being full. Even if it is not deleted (as is the case in root-project#7657 ) it will get pushed into the Pool once by unload() and a second time by the RAII! The two options to track this case were full-blown ref counting or noting that a Transaction is attached to an RAII scope. If that is the case, freeing the Transaction must be skipped, as the RAII will take care of it.
a cheap way to notice what went wrong in root-project#7657.
When a Transaction is unloaded for which *also* a ScopedTransactionRAII is waiting, the latter will potentially access a deleted Transaction. It can be deleted due to the Pool being full. Even if it is not deleted (as is the case in root-project#7657 ) it will get pushed into the Pool once by unload() and a second time by the RAII! The two options to track this case were full-blown ref counting or noting that a Transaction is attached to an RAII scope. If that is the case, freeing the Transaction must be skipped, as the RAII will take care of it.
a cheap way to notice what went wrong in root-project#7657.
@scott-snyder @krasznaa is there any chance that you can confirm #7767 (for v6-24-00-patches) or #7752 (for master) is resolving this for ATLAS? |
I'm not 100% sure about this, but #7767 does not seem to be ABI compatible with the current head of Testing an ABI compatible change, while a bit cumbersome, would be doable. But testing a change that's not ABI compatible would require rebuilding pretty much all our code. Including all the parts of LCG that depend on ROOT. That we are just not set up to be able to do. 😦 But Scott may have a different view... |
It is not ABI compatible per se but for your purpose it is sufficiently so. The code in cling/llvm is only used/access from libCling which should only be accessed through the TIntepre ter interface (i.e. no user code should need to be rebuild). |
In that case Scott will probably try this out himself. But if he doesn't, I'll give it a go tomorrow. 😉 |
hi - This fixes at least the specific crashes i was looking at here. |
a cheap way to notice what went wrong in root-project#7657.
Outer RAIIs might still reference the Transaction, and unload is assuming that it owns the transaction and can delete it / put it into the TransactionPool. This fixes root-project#7657 We still need to track ownership as what has happened here (unload of a Transaction held by an RAII) can happen again / elsewhere. This will be addressed by a subsequent PR in master.
a cheap way to notice what went wrong in root-project#7657.
Outer RAIIs might still reference the Transaction, and unload is assuming that it owns the transaction and can delete it / put it into the TransactionPool. This fixes root-project#7657 We still need to track ownership as what has happened here (unload of a Transaction held by an RAII) can happen again / elsewhere. This will be addressed by a subsequent PR in master.
Outer RAIIs might still reference the Transaction, and unload is assuming that it owns the transaction and can delete it / put it into the TransactionPool. This fixes root-project#7657 We still need to track ownership as what has happened here (unload of a Transaction held by an RAII) can happen again / elsewhere. This will be addressed by a subsequent PR in master.
Outer RAIIs might still reference the Transaction, and unload is assuming that it owns the transaction and can delete it / put it into the TransactionPool. This fixes root-project#7657 We still need to track ownership as what has happened here (unload of a Transaction held by an RAII) can happen again / elsewhere. This will be addressed by a subsequent PR in master. (cherry picked from commit 0202a4b)
I had to redo the PRs; I now have the following:
Could you let me know whether that also solves the issues you saw? Any one of them is enough. Thank you, @krasznaa @scott-snyder ! |
Yes, that also seems to fix them. |
Thank you very much, @scott-snyder - both for this excellent analysis of the problem and the your quick turnaround testing! |
a cheap way to notice what went wrong in #7657.
Outer RAIIs might still reference the Transaction, and unload is assuming that it owns the transaction and can delete it / put it into the TransactionPool. This fixes #7657 We still need to track ownership as what has happened here (unload of a Transaction held by an RAII) can happen again / elsewhere. This will be addressed by a subsequent PR in master. (cherry picked from commit 0202a4b)
a cheap way to notice what went wrong in root-project/root#7657.
Outer RAIIs might still reference the Transaction, and unload is assuming that it owns the transaction and can delete it / put it into the TransactionPool. This fixes root-project/root#7657 We still need to track ownership as what has happened here (unload of a Transaction held by an RAII) can happen again / elsewhere. This will be addressed by a subsequent PR in master.
hi -
In the past few days, we've been seeing a new dictionary-related crash
in the atlas builds using root-6.24.00-patches.
I apologize that i don't have a standalone reproducer for this, but it
can be reproduced readily on lxplus in a recent atlas nightly:
Note that the problem does not occur if run against the dbg root libraries.
Here is the stack trace:
The ultimate trigger here is what it usually is --- trying to find the TClass
for something for which no dictionary was generated, leading to autoparsing.
I can get rid of the crash in the atlas build by generating dictionaries
for these classes (here, various specialization of
CondMultChanCollection<LArConditionsSubset<T> >
).However, i also looked at bit at the root cause of the crash.
The summary is that it seems to involve a Transaction object being released
multiple times back to the TransactionPool. And the fact that TransactionPool
is used only in opt builds is why we don't see this crash if libCling
is compiled with debugging.
The crash in RegisterModule happens in the last line here:
because the Transaction T we get back is empty (size()==0),
so T->getFirstDecl() is undefined.
The reason this happens is because we have the same Transaction object
entered twice in the TransactionPool. Interpreter::declare gets the Transaction
object from the pool and fills it in. It has a valid entry at this time.
But it then creates a nested transation. It gets a new pointer from the
pool, which points at the same Transaction object still in use.
When the constructor is run, the entry in that Transaction object is erased,
ultimately leading to getting an empty transaction above.
Now, the TransactionPool gets messed up on an earlier GetClass call,
at this code in LookupHelper.cxx:
The call stack here is
The unload() call here may free the Transaction T
(via IncrementalParser::deregisterTransaction) and return it to the pool.
But the code in findScope is protected by
The dtor of this runs when we hit the return above --- and this will
end up releasing the same Transaction object to the pool.
This version of the code in findScope was introduced by this change:
As a workaround, this change makes the crash go away:
but this doesn't feel like a proper fix.
I apologize again for not having a standalone reproducer for this,
but i hope this is helpful anyway.
The text was updated successfully, but these errors were encountered: