Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ROOT I/O issues with HepMCProduct #26022

Closed
Dr15Jones opened this issue Feb 26, 2019 · 87 comments
Closed

Fix ROOT I/O issues with HepMCProduct #26022

Dr15Jones opened this issue Feb 26, 2019 · 87 comments

Comments

@Dr15Jones
Copy link
Contributor

Recently CMS production has hit a number of cases where reading back HepMCProduct from a file causes a stack overflow. Philippe Canal and I investigated the issue and we believe we understand what is occuring.

HepMCProduct stores a HepMC::GenEvent. HepMC::GenEvent holds two maps, one of all HepMC::GenParticles the other of all HepMC::GenVertex-s. In addition, HepMC::GenParticle holds a pointer to its production HepMC::GenVertex and a pointer to its end HepMC::GenVertex. HepMC::GenVertex holds a std::vector of HepMC::GenParticles coming into to the vertex and another std::vector of those going out.

This means, from any one HepMC::GenVertex or HepMC::GenParticle, one could follow the various pointers and traverse the entire decay tree structure. In fact, we think this is what is happening. When storing an object, ROOT starts at its first member data, fully writes it out then goes to the next. For both HepMC::GenVertex and HepMC::GenParticle, the member data holding pointers to nodes earlier in the decay tree appear before those holding pointers to farther down the decay tree. This means when ROOT starts following pointers, it first goes up the decay tree before coming down and then keeps trying to go up until it reaches a pointer it has already seen. This effective causes a 'breadth-first' search of the tree all the while pushing more and more functions onto the call stack as it does the traversal.

We should be able to minimize the problem by changing the order in which member variables appear in HepMC::GenVertex and HepMC::GenParticle.

For HepMC::GenVertex, we swap the order of m_particles_in and m_particles_out.
For HepMC::GenParticle, we swap the order of m_production_vertex and m_end_vertex.

Such a change will make ROOT do a depth first search through the decay tree which we expect to be a much shorter call stack.

@cmsbuild
Copy link
Contributor

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

assign generator, core

@Dr15Jones
Copy link
Contributor Author

@pcanal FYI

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@kpedro88
Copy link
Contributor

assign generators

@cmsbuild
Copy link
Contributor

New categories assigned: generators

@alberto-sanchez,@perrozzi,@efeyazgan,@qliphy you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor Author

I just did a simple test to check the recursion depth for the original ROOT algorithm and for one based on the proposed changes to HepMC. Unfortunately, this proposal actually made the stack larger, not smaller. It went from 1278 to 1350.

@Dr15Jones
Copy link
Contributor Author

Philippe and I think we might be able to play with the ROOT I/O rules and work around the problem.

The idea would be not to store HepMC::GenVertex::m_particles_in and HepMC::GenParticle::m_production_vertex and reset them on read back.

@Dr15Jones Dr15Jones reopened this Feb 26, 2019
@Dr15Jones
Copy link
Contributor Author

By emulating how ROOT tranverses the data structure, I've determined that all we need to do is drop HepMC::GenParticle::m_production_vertex and then the maximum depth is only 25.

@Dr15Jones
Copy link
Contributor Author

@pcanal I am unable to create io rules for HepMC classes because of this file in ROOT

https://github.com/root-project/root/blob/master/etc/class.rules

@pcanal
Copy link
Contributor

pcanal commented Feb 26, 2019

what is the error message? (the rule should be cumulative)

@Dr15Jones
Copy link
Contributor Author

The error while our code checked the class version number

>> Checking EDM Class Version for src/SimDataFormats/GeneratorProducts/src/classes_def.xml in libSimDataFormatsGeneratorProducts.so
TGenericClassInfo:0: RuntimeWarning: The rule for class: "HepMC::GenParticle": version, "[1-]" and data members: "m_end_vertex" has been skipped because The existing rule is:
   HepMC::GenParticle m_end_vertex attributes="NotOwner" 
and the ignored rule is:
   type=read sourceClass="HepMC::GenParticle" targetClass="HepMC::GenParticle" version="[1-]" source="HepMC::GenVertex* m_end_vertex" target="m_end_vertex" code="{m_end_vertex = onfile.m_end_vertex;
                  if(m_end_vertex) { m_end_vertex->add_particle_in(newObj); }
}" .
.

@pcanal
Copy link
Contributor

pcanal commented Feb 27, 2019

To work-around this, you can assign the rule to m_barcode:

type=read sourceClass="HepMC::GenParticle" targetClass="HepMC::GenParticle" version="[1-]" source="int m_end_vertex" target="m_barcode" code="{m_barcode = onfile.m_barcode;
                  if(newObj->end_vertex()) { newObj->end_vertex()->add_particle_in(newObj); }

which will work as well as the first version since m_barcode is stored/retrieved right after m_end_vertex.

@Dr15Jones
Copy link
Contributor Author

I assume you meant source="int m_barcode" not source="int m_end_vertex"

@Dr15Jones
Copy link
Contributor Author

So the iorule caused a segmentation fault when trying to read the old file which was written without the rule.

#4  <signal handler called>
#5  0x00000035dbc89c28 in _wordcopy_fwd_aligned () from /lib64/libc.so.6
#6  0x00000035dbc83a42 in memmove () from /lib64/libc.so.6
#7  0x00007fe620b9c6a0 in std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<HepMC::GenParticle*> (__result=<optimized out>, __last=<optimized out>, __first=<optimized out>) at /build/cmsbld/auto-builds/CMSSW_9_0_0_pre5-slc6_amd64_gcc630/build/CMSSW_9_0_0_pre5-build/slc6_amd64_gcc630/external/gcc/6.3.0/include/c++/6.3.0/bits/stl_algobase.h:368
#8  std::__copy_move_a<false, HepMC::GenParticle**, HepMC::GenParticle**> (__result=<optimized out>, __last=<optimized out>, __first=<optimized out>) at /build/cmsbld/auto-builds/CMSSW_9_0_0_pre5-slc6_amd64_gcc630/build/CMSSW_9_0_0_pre5-build/slc6_amd64_gcc630/external/gcc/6.3.0/include/c++/6.3.0/bits/stl_algobase.h:386
#9  std::__copy_move_a2<false, __gnu_cxx::__normal_iterator<HepMC::GenParticle**, std::vector<HepMC::GenParticle*, std::allocator<HepMC::GenParticle*> > >, __gnu_cxx::__normal_iterator<HepMC::GenParticle**, std::vector<HepMC::GenParticle*, std::allocator<HepMC::GenParticle*> > > > (__result=..., __last=..., __first=...) at /build/cmsbld/auto-builds/CMSSW_9_0_0_pre5-slc6_amd64_gcc630/build/CMSSW_9_0_0_pre5-build/slc6_amd64_gcc630/external/gcc/6.3.0/include/c++/6.3.0/bits/stl_algobase.h:422
#10 std::copy<__gnu_cxx::__normal_iterator<HepMC::GenParticle**, std::vector<HepMC::GenParticle*, std::allocator<HepMC::GenParticle*> > >, __gnu_cxx::__normal_iterator<HepMC::GenParticle**, std::vector<HepMC::GenParticle*, std::allocator<HepMC::GenParticle*> > > > (__result=..., __last=..., __first=...) at /build/cmsbld/auto-builds/CMSSW_9_0_0_pre5-slc6_amd64_gcc630/build/CMSSW_9_0_0_pre5-build/slc6_amd64_gcc630/external/gcc/6.3.0/include/c++/6.3.0/bits/stl_algobase.h:456
#11 std::vector<HepMC::GenParticle*, std::allocator<HepMC::GenParticle*> >::_M_erase (this=this@entry=0x7fe6176f3f28, __position=...) at /build/cmsbld/auto-builds/CMSSW_9_0_0_pre5-slc6_amd64_gcc630/build/CMSSW_9_0_0_pre5-build/slc6_amd64_gcc630/external/gcc/6.3.0/include/c++/6.3.0/bits/vector.tcc:145
#12 0x00007fe620b9b1dd in std::vector<HepMC::GenParticle*, std::allocator<HepMC::GenParticle*> >::erase (__position=..., this=0x7fe6176f3f28) at /build/cmsbld/auto-builds/CMSSW_9_0_0_pre5-slc6_amd64_gcc630/build/CMSSW_9_0_0_pre5-build/slc6_amd64_gcc630/external/gcc/6.3.0/include/c++/6.3.0/bits/stl_vector.h:1151
#13 HepMC::GenVertex::remove_particle_in (this=0x7fe6176f3f00, particle=<optimized out>) at GenVertex.cc:320
#14 0x00007fe620b9b203 in HepMC::GenVertex::add_particle_in (this=0x7fe6176f3f00, inparticle=<optimized out>) at GenVertex.cc:278
#15 0x00007fe635993dd5 in int TStreamerInfo::ReadBufferArtificial<char**>(TBuffer&, char** const&, TStreamerElement*, int, int) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#16 0x00007fe635a06711 in int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#17 0x00007fe63590229d in TStreamerInfoActions::GenericReadAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#18 0x00007fe63586ea75 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#19 0x00007fe6358722e1 in TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#20 0x00007fe6358768ab in TBufferFile::ReadObjectAny(TClass const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#21 0x00007fe6358d2029 in TGenCollectionStreamer::ReadObjects(int, TBuffer&, TClass const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#22 0x00007fe6358d2db7 in TGenCollectionStreamer::ReadBufferGeneric(TBuffer&, void*, TClass const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#23 0x00007fe63586f1ea in TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
...
#120 0x00007fe63590229d in TStreamerInfoActions::GenericReadAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#121 0x00007fe63586ea75 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#122 0x00007fe6358722e1 in TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#123 0x00007fe635a064ca in int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#124 0x00007fe63590229d in TStreamerInfoActions::GenericReadAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#125 0x00007fe63586ea75 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libRIO.so
#126 0x00007fe635e14e1d in TBranchElement::ReadLeavesMember(TBuffer&) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libTree.so
#127 0x00007fe635e0807a in TBranch::GetEntry(long long, int) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libTree.so
#128 0x00007fe635e20011 in TBranchElement::GetEntry(long long, int) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libTree.so
#129 0x00007fe635e1ffca in TBranchElement::GetEntry(long long, int) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/external/slc6_amd64_gcc630/lib/libTree.so
#130 0x00007fe6274e8b85 in edm::RootTree::getEntry(TBranch*, long long) const () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/lib/slc6_amd64_gcc630/pluginIOPoolInput.so
#131 0x00007fe6274f38a3 in edm::RootDelayedReader::getProduct_(edm::BranchKey const&, edm::EDProductGetter const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/lib/slc6_amd64_gcc630/pluginIOPoolInput.so
#132 0x00007fe63626d06b in edm::DelayedReader::getProduct(edm::BranchKey const&, edm::EDProductGetter const*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#133 0x00007fe63623d9f9 in edm::InputProductResolver::prefetchAsync_(edm::WaitingTask*, edm::Principal const&, bool, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1}::operator()() const () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#134 0x00007fe63623dc3a in edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::InputProductResolver::prefetchAsync_(edm::WaitingTask*, edm::Principal const&, bool, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1}>(edm::InputProductResolver::prefetchAsync_(edm::WaitingTask*, edm::Principal const&, bool, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1} const&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#135 0x00007fe634e8f22c in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7fe633e07200, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:509
#136 0x00007fe6362dc1cc in edm::EventProcessor::readAndProcessEvents() () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#137 0x00007fe6362e6e91 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/slc6_amd64_gcc630/cms/cmssw/CMSSW_9_4_7/lib/slc6_amd64_gcc630/libFWCoreFramework.so
#138 0x000000000040e7e2 in main::{lambda()#1}::operator()() const ()
#139 0x000000000040d19a in main ()

@Dr15Jones
Copy link
Contributor Author

I have worked around the segmentation fault. The crash was caused because GenVertex assumes that it is already holding the GenParticle which causes the erase to do bad things.

@pcanal
Copy link
Contributor

pcanal commented Feb 27, 2019

I assume you meant source="int m_barcode" not source="int m_end_vertex"

Yes, sorry.

@Dr15Jones
Copy link
Contributor Author

So I wrote a new file using the new iorules and making one of the members transient.

The branch for the new file is

*............................................................................*
*Br    9 :edmHepMCProduct_generatorSmeared__SIM.obj : edm::HepMCProduct      *
*Entries :     2000 : Total  Size= 1072104096 bytes  File Size  =  301897458 *
*Baskets :      388 : Basket Size=    3802112 bytes  Compression=   3.55     *
*............................................................................*

compared to the original file

*............................................................................*
*Br    9 :edmHepMCProduct_generatorSmeared__SIM.obj : edm::HepMCProduct      *
*Entries :     2000 : Total  Size= 1120449000 bytes  File Size  =  271620582 *
*Baskets :      350 : Basket Size=      16384 bytes  Compression=   4.13     *
*............................................................................*

@Dr15Jones
Copy link
Contributor Author

There appears to be a problem with the iorule. The value for m_barcode is incorrect when reading back a file which was written using the new rules. Dumping the streamer info for HepMC::GenParticle we see

StreamerInfo for class: HepMC::GenParticle, version=10, checksum=0xbb144396
  HepMC::GenParticle@@10 @@alloc         offset=  0 type=1001                     
  HepMC::FourVector m_momentum      offset=  8 type=62 momentum vector     
  int            m_pdg_id        offset= 40 type= 3 id according to PDG convention
  int            m_status        offset= 44 type= 3 As defined for HEPEVT
  HepMC::Flow    m_flow          offset= 48 type=62                     
  HepMC::Polarization m_polarization  offset=112 type=62                     
  HepMC::GenVertex* m_production_vertex offset=144 type=69  (nodelete) null if vacuum or beam
  HepMC::GenVertex *m_production_vertex offset=144 type=1000                     
  HepMC::GenVertex* m_end_vertex    offset=152 type=69  (nodelete) null if not-decayed 
  HepMC::GenVertex *m_end_vertex   offset=152 type=1000                     
  int            m_barcode       offset=168 type= 3  (cached) unique identifier in the event
  int            m_barcode       offset=  0 type= 3  (write) unique identifier in the event
  int            m_barcode       offset=160 type=1000                     
  double         m_generated_mass offset=168 type= 8 mass of this particle when it was generated
  HepMC::GenParticle@@10 @@dealloc       offset=  0 type=1002                     
   i= 0, @@alloc         type=1001, offset=  0, len=1, method=0
   i= 1, m_momentum      type= 62, offset=  8, len=1, method=0
   i= 2, m_pdg_id        type= 23, offset= 40, len=2, method=0 [optimized]
   i= 3, m_flow          type= 62, offset= 48, len=1, method=0
   i= 4, m_polarization  type= 62, offset=112, len=1, method=0
   i= 5, m_production_vertex type= 69, offset=144, len=1, method=0 [nodelete]
   i= 6, *m_production_vertex type=1000, offset=144, len=1, method=0
   i= 7, m_end_vertex    type= 69, offset=152, len=1, method=0 [nodelete]
   i= 8, *m_end_vertex   type=1000, offset=152, len=1, method=0
   i= 9, m_barcode       type=  3, offset=168, len=1, method=0 [cached]
   i=10, m_barcode       type=  3, offset=  0, len=1, method=0 [write]
   i=11, m_barcode       type=1000, offset=160, len=1, method=0
   i=12, m_generated_mass type=  8, offset=168, len=1, method=0
   i=13, @@dealloc       type=1002, offset=  0, len=1, method=0

One item of note, should the m_barcode and m_generated_mass have the same offset?

@Dr15Jones
Copy link
Contributor Author

If I got CMSSW_10_6 without the ioread rule and dump the streamer info, I get

StreamerInfo for class: HepMC::GenParticle, version=10, checksum=0xbb144396
  HepMC::FourVector m_momentum      offset=  8 type=62 momentum vector     
  int            m_pdg_id        offset= 40 type= 3 id according to PDG convention
  int            m_status        offset= 44 type= 3 As defined for HEPEVT
  HepMC::Flow    m_flow          offset= 48 type=62                     
  HepMC::Polarization m_polarization  offset=112 type=62                     
  HepMC::GenVertex* m_production_vertex offset=144 type=69  (nodelete) null if vacuum or beam
  HepMC::GenVertex *m_production_vertex offset=144 type=1000                     
  HepMC::GenVertex* m_end_vertex    offset=152 type=69  (nodelete) null if not-decayed 
  HepMC::GenVertex *m_end_vertex   offset=152 type=1000                     
  int            m_barcode       offset=160 type= 3 unique identifier in the event
  double         m_generated_mass offset=168 type= 8 mass of this particle when it was generated
   i= 0, m_momentum      type= 62, offset=  8, len=1, method=0
   i= 1, m_pdg_id        type= 23, offset= 40, len=2, method=0 [optimized]
   i= 2, m_flow          type= 62, offset= 48, len=1, method=0
   i= 3, m_polarization  type= 62, offset=112, len=1, method=0
   i= 4, m_production_vertex type= 69, offset=144, len=1, method=0 [nodelete]
   i= 5, *m_production_vertex type=1000, offset=144, len=1, method=0
   i= 6, m_end_vertex    type= 69, offset=152, len=1, method=0 [nodelete]
   i= 7, *m_end_vertex   type=1000, offset=152, len=1, method=0
   i= 8, m_barcode       type=  3, offset=160, len=1, method=0
   i= 9, m_generated_mass type=  8, offset=168, len=1, method=0

@Dr15Jones
Copy link
Contributor Author

I have the same ioread problem in CMSSW_10_6 (which uses root 6.12.07) as I do in CMSSW_9_4_7 (which uses root 6.10.08).

@Dr15Jones
Copy link
Contributor Author

I have provided Philippe with a simple unit test independent of CMSSW which exhibits the failure.

@Dr15Jones
Copy link
Contributor Author

My attempt at the iorule can be found here:
https://github.com/Dr15Jones/cmssw/tree/GenEventIORules

This will not work without a fix in ROOT.

@Dr15Jones
Copy link
Contributor Author

@smuzaffar could you add to this thread how, from GitHub, to get to the exact version of ROOT that was used for this special version of CMSSW_10_6_X_2019-03-14-1200? That way Philippe can review the extra commits we added.

@pcanal
Copy link
Contributor

pcanal commented Mar 15, 2019

In root-project/root#3547 I added a bunch of seeming related commits that are in v6.14 but not in v6.12 ... let me know if they are enough to fix the problem.

@Dr15Jones
Copy link
Contributor Author

@smuzaffar @fabiocos how would you like to test the latest set of commits from Philippe?

@smuzaffar
Copy link
Contributor

@Dr15Jones and @pcanal : we started with root commit root-project/root@3f31cef and created cms/v6-12-00-patches/3f31cef branch. The changes we have on top of it are

cms-sw/root@3f31cef...cms-sw:cms/v6-12-00-patches/3f31cef

@smuzaffar
Copy link
Contributor

@pcanal , can you make a PR identical to root-project/root#3547 for cms-sw/root cms/v6-12-00-patches/3f31cef brach so that we can test them directly in cmssw?

@pcanal
Copy link
Contributor

pcanal commented Mar 15, 2019

This cms-sw/root#123 might be what you are looking for.

@smuzaffar
Copy link
Contributor

thanks @pcanal . cms-bot is running the tests now https://cmssdt.cern.ch/jenkins/job/run-pr-tests/10/console

@smuzaffar
Copy link
Contributor

smuzaffar commented Mar 16, 2019

@Dr15Jones , @pcanal , tests for cms-sw/root#123 are finished now. All tests passed and I see usual 15 comparison changes. Can you please confirm that this fixes the issue? you can find the PR tests results here
cms-sw/root#123 (comment)

FYI, @slava77 , @fabiocos

@Dr15Jones
Copy link
Contributor Author

@smuzaffar to my untrained eye, non of the problems looks worrisome. I would like to get @slava77 's option.

@slava77
Copy link
Contributor

slava77 commented Mar 16, 2019

The tests are not suggesting a disaster (no crash indication in the unit tests).

However, the differences in comparisons like at the link below are showing changes that are currently blocking our switch from 6.12 to 6.14
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_6_X_2019-03-14-2300+c1f413/30807/136.8311_RunJetHT2017F_reminiaod+RunJetHT2017F_reminiaod+REMINIAOD_data2017+HARVEST2017_REMINIAOD_data2017/Tracking_PackedCandidate_lostTracks.html

These are indicative of a problem in reading the reco::HitPattern class from AOD inputs older than the recently introduced updates (v12 in the inputs)

<ioread
sourceClass="reco::HitPattern"
source="uint16_t hitPattern[50]"
targetClass="reco::HitPattern"
target="hitPattern"
version="[12]"
include="utility"
>
<![CDATA[
using namespace reco;
const unsigned short HitSize = 11;
const unsigned short PatternSize = 50;
const int MaxHits = (PatternSize * sizeof(uint16_t) * 8) / HitSize;
auto getHitFromOldHitPattern = [](const uint16_t hitPattern[], const int position) {
uint16_t bitEndOffset = (position + 1) * HitSize;
uint8_t secondWord = (bitEndOffset >> 4);
uint8_t secondWordBits = bitEndOffset & (16 - 1); // that is, bitEndOffset % 32
if (secondWordBits >= HitSize) {
// full block is in this word
uint8_t lowBitsToTrash = secondWordBits - HitSize;
return (hitPattern[secondWord] >> lowBitsToTrash) & ((1 << HitSize) - 1);
} else {
uint8_t firstWordBits = HitSize - secondWordBits;
uint16_t firstWordBlock = hitPattern[secondWord - 1] >> (16 - firstWordBits);
uint16_t secondWordBlock = hitPattern[secondWord] & ((1 << secondWordBits) - 1);
return firstWordBlock + (secondWordBlock << firstWordBits);
}
};
auto appendOldHitPattern = [&](const uint16_t pattern) {
// for this version we just have to add a 0 bit to the top of the pattern
const static unsigned short HitTypeMask = 0x3;
const static unsigned short HitTypeOffset = 0;
const uint16_t VALID_CONST = (uint16_t) TrackingRecHit::valid;
const uint16_t MISSING_CONST = (uint16_t) TrackingRecHit::missing;
const uint16_t INACTIVE_CONST = (uint16_t) TrackingRecHit::inactive;
const uint16_t BAD_CONST = (uint16_t) TrackingRecHit::bad;
uint16_t rawHitType = (pattern >> HitTypeOffset) & HitTypeMask;
TrackingRecHit::Type hitType = TrackingRecHit::valid;
switch (rawHitType) {
case VALID_CONST:
hitType = TrackingRecHit::valid;
break;
case MISSING_CONST:
hitType = TrackingRecHit::missing;
break;
case INACTIVE_CONST:
hitType = TrackingRecHit::inactive;
break;
case BAD_CONST:
hitType = TrackingRecHit::bad;
break;
}
return newObj->appendHit(pattern,hitType);
};
auto fillNewHitPatternWithOldHitPattern = [&](const uint16_t oldHitPattern[]) {
newObj->clear();
for (int i = 0; i < MaxHits; i++) {
uint16_t pattern = getHitFromOldHitPattern(oldHitPattern, i);
if (pattern == 0) {
break;
}
if(!appendOldHitPattern(pattern)) {
return false;
}
}
return true;
};
fillNewHitPatternWithOldHitPattern(onfile.hitPattern);
]]>
</ioread>

@Dr15Jones
Copy link
Contributor Author

@slava77 just to make sure I understand, this pull request has introduced a bug we see in ROOT 6.14 which is related to iorules, particular on reco::HitPattern?

@Dr15Jones
Copy link
Contributor Author

@pcanal I did a simple test where I loaded the same file into a version of ROOT 6.12 and ROOT 6.14 with the same CMSSW version. Then I did

ROOT 6.12
root [1] Events->Scan("recoTracks_generalTracks__RECO.obj.hitPattern_.hitPattern[50]")
***********************************
*    Row   * Instance * recoTrack *
***********************************
*        0 *        0 *         0 *
*        0 *        1 *         0 *
*        0 *        2 *         0 *
*        0 *        3 *         0 *
*        0 *        4 *        17 *
*        0 *        5 *         0 *
*        0 *        6 *     39321 *
*        0 *        7 *     39321 *
*        0 *        8 *     20832 *
*        0 *        9 *      3530 *
*        0 *       10 *         0 *
*        0 *       11 *         0 *
*        0 *       12 *         0 *
*        0 *       13 *         0 *
*        0 *       14 *         0 *
*        0 *       15 *         0 *
*        0 *       16 *         0 *
*        0 *       17 *         0 *
*        0 *       18 *     39321 *
*        0 *       19 *     39321 *
*        0 *       20 *        81 *
*        0 *       21 *         0 *
*        0 *       22 *         0 *
*        0 *       23 *         0 *
*        0 *       24 *     24536 *
Type <CR> to continue or q to quit ==> q
***********************************

ROOT 6.14

root [2] Events->Scan("recoTracks_generalTracks__RECO.obj.hitPattern_.hitPattern[50]")
***********************************
*    Row   * Instance * recoTrack *
***********************************
*        0 *        0 *         0 *
*        0 *        1 *         0 *
*        0 *        2 *         0 *
*        0 *        3 *         0 *
*        0 *        4 *         0 *
*        0 *        5 *         0 *
*        0 *        6 *         0 *
*        0 *        7 *         0 *
*        0 *        8 *         0 *
*        0 *        9 *         0 *
*        0 *       10 *         0 *
*        0 *       11 *         0 *
*        0 *       12 *         0 *
*        0 *       13 *         0 *
*        0 *       14 *         0 *
*        0 *       15 *         0 *
*        0 *       16 *         0 *
*        0 *       17 *         0 *
*        0 *       18 *         0 *
*        0 *       19 *         0 *
*        0 *       20 *         0 *
*        0 *       21 *         0 *
*        0 *       22 *         0 *
*        0 *       23 *         0 *
*        0 *       24 *         0 *

@pcanal
Copy link
Contributor

pcanal commented Mar 16, 2019

which one is correct?

@slava77
Copy link
Contributor

slava77 commented Mar 16, 2019 via email

@Dr15Jones
Copy link
Contributor Author

I can state for a fact that in the ROOT 6.14 test using 'Scan', that the version 12 read rule is begin run (since I added a print statement to it and see that message).

@Dr15Jones
Copy link
Contributor Author

@slava77

[50] is out of the array bounds in v12 of the class.

Here is the change from v12 to v13
95dbe75#diff-72b994d8c70121f8e5d64f0b2b9728a8

The array in question has size 50 in v12 and size 57 in v13.

@Dr15Jones
Copy link
Contributor Author

So I checked and the file in question was created using CMSSW_9_4_0. So I ran the same test in that release and get

root [1] Events->Scan("recoTracks_generalTracks__RECO.obj.hitPattern_.hitPattern[50]")
***********************************
*    Row   * Instance * recoTrack *
***********************************
*        0 *        0 *     33928 *
*        0 *        1 *      9764 *
*        0 *        2 *     43329 *
*        0 *        3 *     51288 *
*        0 *        4 *      5714 *
*        0 *        5 *     34995 *
*        0 *        6 *     13414 *
*        0 *        7 *      6594 *
*        0 *        8 *     28814 *
*        0 *        9 *     25480 *
*        0 *       10 *     58396 *
*        0 *       11 *     26400 *
*        0 *       12 *     51769 *
*        0 *       13 *     20049 *
*        0 *       14 *     39027 *
*        0 *       15 *      7403 *
*        0 *       16 *       232 *
*        0 *       17 *         0 *
*        0 *       18 *         0 *
*        0 *       19 *         0 *
*        0 *       20 *         0 *
*        0 *       21 *         0 *
*        0 *       22 *         0 *
*        0 *       23 *         0 *
*        0 *       24 *         0 *

@Dr15Jones
Copy link
Contributor Author

So instead of looking at the bit patterns, I decided to call functions on the class to see what results we get

CMSSW_9_4_0

root [1] Events->Scan("recoTracks_generalTracks__RECO.obj.hitPattern_.numberOfValidHits()")
***********************************
*    Row   * Instance * recoTrack *
***********************************
*        0 *        0 *        24 *
*        0 *        1 *        16 *
*        0 *        2 *        17 *
*        0 *        3 *        12 *
*        0 *        4 *        18 *
*        0 *        5 *        15 *
*        0 *        6 *        19 *
*        0 *        7 *        17 *
*        0 *        8 *        17 *
*        0 *        9 *        16 *
*        0 *       10 *        14 *
*        0 *       11 *        17 *
*        0 *       12 *        16 *
*        0 *       13 *         5 *
*        0 *       14 *        19 *
*        0 *       15 *        16 *
*        0 *       16 *        20 *
*        0 *       17 *        19 *
*        0 *       18 *        19 *
*        0 *       19 *        19 *
*        0 *       20 *         7 *
*        0 *       21 *        16 *
*        0 *       22 *        24 *
*        0 *       23 *         5 *
*        0 *       24 *         6 *

CMSSW_10_6_X_2019-03-15-2300 (which uses the non patched ROOT 6.12)

***********************************
*    Row   * Instance * recoTrack *
***********************************
*        0 *        0 *        24 *
*        0 *        1 *        16 *
*        0 *        2 *        17 *
*        0 *        3 *        12 *
*        0 *        4 *        18 *
*        0 *        5 *        15 *
*        0 *        6 *        19 *
*        0 *        7 *        17 *
*        0 *        8 *        17 *
*        0 *        9 *        16 *
*        0 *       10 *        14 *
*        0 *       11 *        17 *
*        0 *       12 *        16 *
*        0 *       13 *         5 *
*        0 *       14 *        19 *
*        0 *       15 *        16 *
*        0 *       16 *        20 *
*        0 *       17 *        19 *
*        0 *       18 *        19 *
*        0 *       19 *        19 *
*        0 *       20 *         7 *
*        0 *       21 *        16 *
*        0 *       22 *        24 *
*        0 *       23 *         5 *
*        0 *       24 *         6 *

CMSSW_10_6_ROOT614_X_2019-03-15-2300

***********************************
*    Row   * Instance * recoTrack *
***********************************
*        0 *        0 *        24 *
*        0 *        1 *        16 *
*        0 *        2 *        17 *
*        0 *        3 *        12 *
*        0 *        4 *        18 *
*        0 *        5 *        15 *
*        0 *        6 *        19 *
*        0 *        7 *        17 *
*        0 *        8 *        17 *
*        0 *        9 *        16 *
*        0 *       10 *        14 *
*        0 *       11 *        17 *
*        0 *       12 *        16 *
*        0 *       13 *         5 *
*        0 *       14 *        19 *
*        0 *       15 *        16 *
*        0 *       16 *        20 *
*        0 *       17 *        19 *
*        0 *       18 *        19 *
*        0 *       19 *        19 *
*        0 *       20 *         7 *
*        0 *       21 *        16 *
*        0 *       22 *        24 *
*        0 *       23 *         5 *
*        0 *       24 *         6 *

So the beginning of the list look identical.

@Dr15Jones
Copy link
Contributor Author

So I used a very simple FWLite program

#include "DataFormats/FWLite/interface/Event.h"
#include "DataFormats/FWLite/interface/Handle.h"
#include "DataFormats/TrackReco/interface/Track.h"
#include <vector>
#include "TFile.h"

void dump() {

  TFile f("0054D144-36DF-E711-B074-02163E01A47B.root","r");

  fwlite::Event ev(&f);

  unsigned int evCount=0;
  for(ev.toBegin(); not ev.atEnd(); ++ev) {
    ++evCount;
    fwlite::Handle<std::vector<reco::Track>> h;
    h.getByLabel(ev, "generalTracks");

    for(auto const& t: *h) {
      std::cout <<">>> "<< evCount<<" "<<t.hitPattern().numberOfValidHits() <<std::endl;
    }
  }
}

and ran the result on both 6.12 and 6.14 and then compared the output, it was identical.

@slava77 do you know what differences were seen in the hit patterns?

@Dr15Jones
Copy link
Contributor Author

I extended my FWLite program to dump the values returned for all const member functions of HitPattern which take no arguments. I then compared the values for all three cases. I found differences right in the first event. CMSSW_9_4_0 and CMSSW_10_6_X_2019-03-15-2300 agreed on all the values. CMSSW_10_6_ROOT614_X_2019-03-15-2300 disagreed for values of

 h.trackerLayersNull()
 h.pixelLayersNull()
 h.stripLayersNull()
 h.pixelBarrelLayersNull()
 h.pixelEndcapLayersNull()
 h.stripTIBLayersNull()
 h.stripTIDLayersNull()
 h.stripTOBLayersNull()
 h.stripTECLayersNull()
 h.numberOfInactiveHits()
 h.numberOfInactiveTrackerHits()

Surprisingly, the discrepencies were never very large

230c238
< >>> 1,3 51 -> 17
---
> >>> 1,3 51 -> 20
232c240
< >>> 1,3 53 -> 14
---
> >>> 1,3 53 -> 17
237c245
< >>> 1,3 58 -> 2
---
> >>> 1,3 58 -> 5
289c297
< >>> 1,4 51 -> 15
---
> >>> 1,4 51 -> 16
291c299
< >>> 1,4 53 -> 12
---
> >>> 1,4 53 -> 13
296c304
< >>> 1,4 58 -> 0
---
> >>> 1,4 58 -> 1
525c533
< >>> 1,8 51 -> 17
---
> >>> 1,8 51 -> 18
527c535
< >>> 1,8 53 -> 14
---
> >>> 1,8 53 -> 15
533c541
< >>> 1,8 59 -> 1
---
> >>> 1,8 59 -> 2

Where the first number is the event count (only 1 in this case) the second number is the track number, the third number is which function I'm calling and the number after -> is the value returned from the function.

@Dr15Jones
Copy link
Contributor Author

At a quick first glance, 6_14 has fewer 'nulls' but more 'inactive' hits.

@Dr15Jones
Copy link
Contributor Author

Dr15Jones commented Mar 17, 2019

So I added print statements inside the IO rule right after newObj is filled:

std::cout <<"trackerLayersNull() "<<newObj->trackerLayersNull()<<std::endl;

In both the ROOT 6.12 and 6.14 versions of CMSSW_10_6, the code prints the same thing

trackerLayersNull() 14
trackerLayersNull() 16
trackerLayersNull() 16
trackerLayersNull() 17

However, for the 6.12 case, once the HitPattern object is retrieved from the fwlite::Event the '4'th entry's number changes to 20!

So I think what is happening is the order in with OTHER variables in the HitPattern are being filled is different from ROOT 6.12 to 6.14. I bet in 6.14 the other variables are being filled first and then the call to clear() gets rid of them and calling appendHit does not properly restore the state of the object. Then in ROOT 6.12 the other variables are being set AFTER the call to the IO rule and they are compensating for the incorrect algorithm.

@slava77
Copy link
Contributor

slava77 commented Mar 17, 2019

@Dr15Jones
I'm catching up now.
Do I read your last message correctly that there is a bug in the iorule?

@Dr15Jones
Copy link
Contributor Author

Do I read your last message correctly that there is a bug in the iorule?

Yes. The algorithm is flawed and just happens to work on 6.12.

@Dr15Jones
Copy link
Contributor Author

@slava77 #26205 should fix the problem with HitPattern.

@fabiocos
Copy link
Contributor

@Dr15Jones @smuzaffar with the latest ROOT614 builds the IB looks clean. I have tried a long run (500 events, runTheMatrix.py -l 10824.0 -t 8 --command "-n 500 --customise Validation/Performance/TimeMemorySummary.py" ), and I got it to succeeed without issues on cmsdev25

@smuzaffar
Copy link
Contributor

+core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants