Error reading STL vectors inside TClonesArrays #374

rklasen · 2019-10-09T10:07:41Z

Hi,

this issue is an addition or refinement to #373, but we think we found the root cause. It appears uproot crashes when it tries to access a vector inside a struct/class which was written to a TClonesArray. My colleague wrote a minimum example using only ROOT and uproot, not boost whatsoever.

rootFileTest.C:

#include "TClonesArray.h"
#include "TFile.h"
#include "TTree.h"

struct A {
  double foo;
  int bar;
};

struct B : public TObject {
  std::vector<A> foobars;
};

void rootFileTest() {
  TFile f("output.root", "RECREATE");
  TTree t("asdf", "asdf");
  TClonesArray arr("B", 1);

  new (arr[0]) B;
  B *b = (B *)arr.ConstructedAt(0);
  b->foobars.push_back({0.1, 1});

  t.Branch("myarr", &arr);
  t.Fill();
  t.Write();
}

readTree.py

import uproot
import numpy

f = uproot.open("output.root")
events = f["asdf"]

print(events["myarr.foobars"].array())

Compile and run with:

root -l rootFileTest.C+
root [0] 
Processing rootFileTest.C+...
Info in <TUnixSystem::ACLiC>: creating shared library /home/roman/temp/rootTest/./rootFileTest_C.so
Warning in cling::IncrementalParser::CheckABICompatibility():
  Possible C++ standard library mismatch, compiled with __GLIBCXX__ '20180720'
  Extraction of runtime standard library version was: '20181206'
root [1] rootFileTest()
root [2] .q

The file output.root is now created and can be read with ROOTs TBrowser:

However, uproot cannot read this branch:

python3 readTree.py 
Traceback (most recent call last):
  File "readTree.py", line 7, in <module>
    print(events["myarr.foobars"].array())
  File "/usr/local/lib/python3.6/dist-packages/uproot/tree.py", line 1386, in array
    basket_itemoffset = self._basket_itemoffset(interpretation, basketstart, basketstop, keycache)
  File "/usr/local/lib/python3.6/dist-packages/uproot/tree.py", line 1341, in _basket_itemoffset
    numitems = interpretation.numitems(key.border, self.basket_numentries(i))
  File "/usr/local/lib/python3.6/dist-packages/uproot/interp/jagged.py", line 64, in numitems
    return self.content.numitems(numbytes - numentries * self.skipbytes, numentries)
  File "/usr/local/lib/python3.6/dist-packages/uproot/interp/objects.py", line 158, in numitems
    return self.content.numitems(numbytes, numentries)
  File "/usr/local/lib/python3.6/dist-packages/uproot/interp/numerical.py", line 152, in numitems
    assert remainder == 0
AssertionError

The text was updated successfully, but these errors were encountered:

jpivarski · 2019-10-09T10:53:09Z

Quick question: is this reproducer an example of an "in-principle problem" or a simplified version of a problem you're actually facing in physics analysis?

The reason I ask is because uproot's main mode of use is on simple data, in which the processing can be entirely in Numpy and it can be quick. Some complex data structures are handled: the infrastructure is available for any complex data, but many of these structures defy the general rule and have to be handled specially. (There's already special logic in there to handle TClonesArray, for instance.) I do the work to solve particular cases of complex data if they're necessary for physics analysis.

(Maintaining uproot is something I do in addition to my normal work, so I have to find a balance somewhere.)

rklasen · 2019-10-09T11:13:50Z

This is for my day to day work, yes. In this case I'm trying to read the track parameters and the reco hits from which this track came to see if the track fit has worked correctly. Then I need the residuals to perform software alignment for our sensor modules. TClonesArrays are practically everywhere in PandaROOT, but I don't know if that's generally the case in FairROOT (from which we inherit) or ROOT in general.

I understand that this is not a trivial task and I really appreciate all the work you're doing. If you see an easy way to access the fIndex values in the track file I've included in #373 in Python/uproot, that would immensely help me.

jpivarski · 2019-10-09T15:02:56Z

I've looked into the vector-in-TClonesArray and it's surprising: the std::vector is split inside the event, a pattern I haven't seen anywhere else in ROOT serialization. It's too odd to automatically support (would require some re-architecting, all around this special case of TClonesArray), but I can give you a recipe for solving your particular problem.

import uproot, struct, numpy

tree = uproot.open("issue374.root")["asdf"]
branch = tree["myarr.foobars"]

class B(uproot.rootio.ROOTStreamedObject):
    _n_format = struct.Struct(">i")
    _foo_dtype = numpy.dtype(">f8")
    _bar_dtype = numpy.dtype(">i4")

    @classmethod
    def _readinto(cls, self, source, cursor, context, parent):
        start, cnt, self._classversion = uproot.rootio._startcheck(source, cursor)
        cursor.skip(6)                                  # the std::vector header
        self.n = cursor.field(source, self._n_format)   # the std::vector size
        self.foo = cursor.array(source, self.n, self._foo_dtype)
        self.bar = cursor.array(source, self.n, self._bar_dtype)
        uproot.rootio._endcheck(start, cursor, cnt)
        return self

    def __repr__(self):
        return "<B n={0} foo={1} bar={2}>".format(
                   self.n, self.foo.tolist(), self.bar.tolist())

interp = uproot.asgenobj(B, branch._context, skipbytes=0)
branch.array(interp)

The output of the last line is

<ObjectArray [<B n=3 foo=[0.1, 0.2, 0.3] bar=[1, 2, 3]>] at 0x7f792993a390>

(for a file that I made with three foobars in the first event; if we only look at one, we might think that it's unsplit, and you'd hit an error as soon as you looked at something more general).

One aspect of this serialization that I've used (and didn't have to) is the fact that it gives a cross-check on the object size. This is common for objects outside of TTrees, so I'm using the _startcheck/_enccheck from uproot.rootio to do that check. You could skip them with a class/interpretation like this:

class B(uproot.rootio.ROOTStreamedObject):
    _n_format = struct.Struct(">i")
    _foo_dtype = numpy.dtype(">f8")
    _bar_dtype = numpy.dtype(">i4")

    @classmethod
    def _readinto(cls, self, source, cursor, context, parent):
        self.n = cursor.field(source, self._n_format)
        self.foo = cursor.array(source, self.n, self._foo_dtype)
        self.bar = cursor.array(source, self.n, self._bar_dtype)
        return self

    def __repr__(self):
        return "<B n={0} foo={1} bar={2}>".format(
                   self.n, self.foo.tolist(), self.bar.tolist())

interp = uproot.asgenobj(B, branch._context, skipbytes=12)

This gives the same output, does less checking, and may be a little bit faster.

I know that your real data doesn't have foo and bar in it, but I think this example illustrates how you can go about deserializing your case. The cursor.field method (to extract one field) and cursor.fields (to extract a tuple of contiguous fields) use type specifiers from the struct library and cursor.array uses Numpy dtypes. Everything in ROOT serialization is big-endian, so the > is necessary. The cursor is an object that updates its position when you call it (see documentation), while the source is a stateless source of data.

If you want to turn this into a jagged array, see the awkward.fromiter trick I described last time.

Good luck!
(I'm also going to look at the fIndex from your other file.)

jpivarski · 2020-08-31T18:27:55Z

As it turns out, the error above is because I was unaware of ROOT's "memberwise splitting," and (if I said anything to the contrary above), it has nothing to do with Boost serialization. This same error came up in 6 different issues, so further discussion on it will be consolidated into scikit-hep/uproot5#38. (This comment is a form message I'm writing on all 6 issues.)

As of PR scikit-hep/uproot5#87, we can now detect such cases, so at least we'll raise a NotImplementedError instead of letting the deserializer fail in mysterious ways. Someday, it will actually be implemented (watch scikit-hep/uproot5#38), but in the meantime, the thing you can do is write your data "objectwise," not "memberwise." (See this comment for ideas on how to do that, and if you manage to do it, you can help a lot of people out by sharing a recipe.)

jpivarski closed this as completed Oct 9, 2019

jpivarski mentioned this issue Oct 9, 2019

Error reading stl vector from root file #373

Closed

jpivarski mentioned this issue Jul 2, 2020

Handle ROOT's memberwise splitting scikit-hep/uproot5#38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error reading STL vectors inside TClonesArrays #374

Error reading STL vectors inside TClonesArrays #374

rklasen commented Oct 9, 2019

jpivarski commented Oct 9, 2019

rklasen commented Oct 9, 2019

jpivarski commented Oct 9, 2019

jpivarski commented Aug 31, 2020

Error reading STL vectors inside TClonesArrays #374

Error reading STL vectors inside TClonesArrays #374

Comments

rklasen commented Oct 9, 2019

jpivarski commented Oct 9, 2019

rklasen commented Oct 9, 2019

jpivarski commented Oct 9, 2019

jpivarski commented Aug 31, 2020