-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ROOT I/O issues with HepMCProduct #26022
Comments
A new Issue was created by @Dr15Jones Chris Jones. @davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign generator, core |
@pcanal FYI |
New categories assigned: core @Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
assign generators |
New categories assigned: generators @alberto-sanchez,@perrozzi,@efeyazgan,@qliphy you have been requested to review this Pull request/Issue and eventually sign? Thanks |
I just did a simple test to check the recursion depth for the original ROOT algorithm and for one based on the proposed changes to HepMC. Unfortunately, this proposal actually made the stack larger, not smaller. It went from 1278 to 1350. |
Philippe and I think we might be able to play with the ROOT I/O rules and work around the problem. The idea would be not to store HepMC::GenVertex::m_particles_in and HepMC::GenParticle::m_production_vertex and reset them on read back. |
By emulating how ROOT tranverses the data structure, I've determined that all we need to do is drop HepMC::GenParticle::m_production_vertex and then the maximum depth is only 25. |
@pcanal I am unable to create io rules for HepMC classes because of this file in ROOT https://github.com/root-project/root/blob/master/etc/class.rules |
what is the error message? (the rule should be cumulative) |
The error while our code checked the class version number
|
To work-around this, you can assign the rule to m_barcode:
which will work as well as the first version since m_barcode is stored/retrieved right after m_end_vertex. |
I assume you meant |
So the iorule caused a segmentation fault when trying to read the old file which was written without the rule.
|
I have worked around the segmentation fault. The crash was caused because GenVertex assumes that it is already holding the GenParticle which causes the erase to do bad things. |
Yes, sorry. |
So I wrote a new file using the new iorules and making one of the members transient. The branch for the new file is
compared to the original file
|
There appears to be a problem with the iorule. The value for m_barcode is incorrect when reading back a file which was written using the new rules. Dumping the streamer info for HepMC::GenParticle we see
One item of note, should the m_barcode and m_generated_mass have the same offset? |
If I got CMSSW_10_6 without the ioread rule and dump the streamer info, I get
|
I have the same ioread problem in CMSSW_10_6 (which uses root 6.12.07) as I do in CMSSW_9_4_7 (which uses root 6.10.08). |
I have provided Philippe with a simple unit test independent of CMSSW which exhibits the failure. |
My attempt at the iorule can be found here: This will not work without a fix in ROOT. |
@smuzaffar could you add to this thread how, from GitHub, to get to the exact version of ROOT that was used for this special version of CMSSW_10_6_X_2019-03-14-1200? That way Philippe can review the extra commits we added. |
In root-project/root#3547 I added a bunch of seeming related commits that are in v6.14 but not in v6.12 ... let me know if they are enough to fix the problem. |
@smuzaffar @fabiocos how would you like to test the latest set of commits from Philippe? |
@Dr15Jones and @pcanal : we started with root commit root-project/root@3f31cef and created cms/v6-12-00-patches/3f31cef branch. The changes we have on top of it are |
@pcanal , can you make a PR identical to root-project/root#3547 for cms-sw/root cms/v6-12-00-patches/3f31cef brach so that we can test them directly in cmssw? |
This cms-sw/root#123 might be what you are looking for. |
thanks @pcanal . cms-bot is running the tests now https://cmssdt.cern.ch/jenkins/job/run-pr-tests/10/console |
@Dr15Jones , @pcanal , tests for cms-sw/root#123 are finished now. All tests passed and I see usual 15 comparison changes. Can you please confirm that this fixes the issue? you can find the PR tests results here |
@smuzaffar to my untrained eye, non of the problems looks worrisome. I would like to get @slava77 's option. |
The tests are not suggesting a disaster (no crash indication in the unit tests). However, the differences in comparisons like at the link below are showing changes that are currently blocking our switch from 6.12 to 6.14 These are indicative of a problem in reading the reco::HitPattern class from AOD inputs older than the recently introduced updates (v12 in the inputs) cmssw/DataFormats/TrackReco/src/classes_def.xml Lines 30 to 107 in 481113a
|
@slava77 just to make sure I understand, this pull request has introduced a bug we see in ROOT 6.14 which is related to iorules, particular on reco::HitPattern? |
@pcanal I did a simple test where I loaded the same file into a version of ROOT 6.12 and ROOT 6.14 with the same CMSSW version. Then I did
ROOT 6.14
|
which one is correct? |
On 3/16/19 9:41 AM, Chris Jones wrote:
I did a simple test where I loaded the same file into a version of ROOT
6.12 and ROOT 6.14 with the same CMSSW version. Then I did
|ROOT 6.12 root [1]
Events->Scan("recoTracks_generalTracks__RECO.obj.hitPattern_.hitPattern[50]")|
I'm unsure what happens in this case.
I was told once that a call like this without a function does not
trigger a constructor call
and direct data member reading may be just as it looks, a direct read
from file.
[50] is out of the array bounds in v12 of the class.
|
I can state for a fact that in the ROOT 6.14 test using 'Scan', that the version 12 read rule is begin run (since I added a print statement to it and see that message). |
Here is the change from v12 to v13 The array in question has size 50 in v12 and size 57 in v13. |
So I checked and the file in question was created using CMSSW_9_4_0. So I ran the same test in that release and get
|
So instead of looking at the bit patterns, I decided to call functions on the class to see what results we get CMSSW_9_4_0
CMSSW_10_6_X_2019-03-15-2300 (which uses the non patched ROOT 6.12)
CMSSW_10_6_ROOT614_X_2019-03-15-2300
So the beginning of the list look identical. |
So I used a very simple FWLite program
and ran the result on both 6.12 and 6.14 and then compared the output, it was identical. @slava77 do you know what differences were seen in the hit patterns? |
I extended my FWLite program to dump the values returned for all const member functions of
Surprisingly, the discrepencies were never very large
Where the first number is the event count (only 1 in this case) the second number is the track number, the third number is which function I'm calling and the number after |
At a quick first glance, 6_14 has fewer 'nulls' but more 'inactive' hits. |
So I added print statements inside the IO rule right after newObj is filled:
In both the ROOT 6.12 and 6.14 versions of CMSSW_10_6, the code prints the same thing
However, for the 6.12 case, once the So I think what is happening is the order in with OTHER variables in the |
@Dr15Jones |
Yes. The algorithm is flawed and just happens to work on 6.12. |
@Dr15Jones @smuzaffar with the latest ROOT614 builds the IB looks clean. I have tried a long run (500 events, runTheMatrix.py -l 10824.0 -t 8 --command "-n 500 --customise Validation/Performance/TimeMemorySummary.py" ), and I got it to succeeed without issues on cmsdev25 |
+core |
Recently CMS production has hit a number of cases where reading back HepMCProduct from a file causes a stack overflow. Philippe Canal and I investigated the issue and we believe we understand what is occuring.
HepMCProduct stores a HepMC::GenEvent. HepMC::GenEvent holds two maps, one of all HepMC::GenParticles the other of all HepMC::GenVertex-s. In addition, HepMC::GenParticle holds a pointer to its production HepMC::GenVertex and a pointer to its end HepMC::GenVertex. HepMC::GenVertex holds a std::vector of HepMC::GenParticles coming into to the vertex and another std::vector of those going out.
This means, from any one HepMC::GenVertex or HepMC::GenParticle, one could follow the various pointers and traverse the entire decay tree structure. In fact, we think this is what is happening. When storing an object, ROOT starts at its first member data, fully writes it out then goes to the next. For both HepMC::GenVertex and HepMC::GenParticle, the member data holding pointers to nodes earlier in the decay tree appear before those holding pointers to farther down the decay tree. This means when ROOT starts following pointers, it first goes up the decay tree before coming down and then keeps trying to go up until it reaches a pointer it has already seen. This effective causes a 'breadth-first' search of the tree all the while pushing more and more functions onto the call stack as it does the traversal.
We should be able to minimize the problem by changing the order in which member variables appear in HepMC::GenVertex and HepMC::GenParticle.
For HepMC::GenVertex, we swap the order of m_particles_in and m_particles_out.
For HepMC::GenParticle, we swap the order of m_production_vertex and m_end_vertex.
Such a change will make ROOT do a depth first search through the decay tree which we expect to be a much shorter call stack.
The text was updated successfully, but these errors were encountered: