Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Error reading STL vectors inside TClonesArrays #374

Closed
rklasen opened this issue Oct 9, 2019 · 4 comments
Closed

Error reading STL vectors inside TClonesArrays #374

rklasen opened this issue Oct 9, 2019 · 4 comments

Comments

@rklasen
Copy link

rklasen commented Oct 9, 2019

Hi,

this issue is an addition or refinement to #373, but we think we found the root cause. It appears uproot crashes when it tries to access a vector inside a struct/class which was written to a TClonesArray. My colleague wrote a minimum example using only ROOT and uproot, not boost whatsoever.

rootFileTest.C:

#include "TClonesArray.h"
#include "TFile.h"
#include "TTree.h"

struct A {
  double foo;
  int bar;
};

struct B : public TObject {
  std::vector<A> foobars;
};

void rootFileTest() {
  TFile f("output.root", "RECREATE");
  TTree t("asdf", "asdf");
  TClonesArray arr("B", 1);

  new (arr[0]) B;
  B *b = (B *)arr.ConstructedAt(0);
  b->foobars.push_back({0.1, 1});

  t.Branch("myarr", &arr);
  t.Fill();
  t.Write();
}

readTree.py

import uproot
import numpy

f = uproot.open("output.root")
events = f["asdf"]

print(events["myarr.foobars"].array())

Compile and run with:

root -l rootFileTest.C+
root [0] 
Processing rootFileTest.C+...
Info in <TUnixSystem::ACLiC>: creating shared library /home/roman/temp/rootTest/./rootFileTest_C.so
Warning in cling::IncrementalParser::CheckABICompatibility():
  Possible C++ standard library mismatch, compiled with __GLIBCXX__ '20180720'
  Extraction of runtime standard library version was: '20181206'
root [1] rootFileTest()
root [2] .q

The file output.root is now created and can be read with ROOTs TBrowser:

TBrowser

However, uproot cannot read this branch:

python3 readTree.py 
Traceback (most recent call last):
  File "readTree.py", line 7, in <module>
    print(events["myarr.foobars"].array())
  File "/usr/local/lib/python3.6/dist-packages/uproot/tree.py", line 1386, in array
    basket_itemoffset = self._basket_itemoffset(interpretation, basketstart, basketstop, keycache)
  File "/usr/local/lib/python3.6/dist-packages/uproot/tree.py", line 1341, in _basket_itemoffset
    numitems = interpretation.numitems(key.border, self.basket_numentries(i))
  File "/usr/local/lib/python3.6/dist-packages/uproot/interp/jagged.py", line 64, in numitems
    return self.content.numitems(numbytes - numentries * self.skipbytes, numentries)
  File "/usr/local/lib/python3.6/dist-packages/uproot/interp/objects.py", line 158, in numitems
    return self.content.numitems(numbytes, numentries)
  File "/usr/local/lib/python3.6/dist-packages/uproot/interp/numerical.py", line 152, in numitems
    assert remainder == 0
AssertionError
@jpivarski
Copy link
Member

Quick question: is this reproducer an example of an "in-principle problem" or a simplified version of a problem you're actually facing in physics analysis?

The reason I ask is because uproot's main mode of use is on simple data, in which the processing can be entirely in Numpy and it can be quick. Some complex data structures are handled: the infrastructure is available for any complex data, but many of these structures defy the general rule and have to be handled specially. (There's already special logic in there to handle TClonesArray, for instance.) I do the work to solve particular cases of complex data if they're necessary for physics analysis.

(Maintaining uproot is something I do in addition to my normal work, so I have to find a balance somewhere.)

@rklasen
Copy link
Author

rklasen commented Oct 9, 2019

This is for my day to day work, yes. In this case I'm trying to read the track parameters and the reco hits from which this track came to see if the track fit has worked correctly. Then I need the residuals to perform software alignment for our sensor modules. TClonesArrays are practically everywhere in PandaROOT, but I don't know if that's generally the case in FairROOT (from which we inherit) or ROOT in general.

I understand that this is not a trivial task and I really appreciate all the work you're doing. If you see an easy way to access the fIndex values in the track file I've included in #373 in Python/uproot, that would immensely help me.

@jpivarski
Copy link
Member

I've looked into the vector-in-TClonesArray and it's surprising: the std::vector is split inside the event, a pattern I haven't seen anywhere else in ROOT serialization. It's too odd to automatically support (would require some re-architecting, all around this special case of TClonesArray), but I can give you a recipe for solving your particular problem.

import uproot, struct, numpy

tree = uproot.open("issue374.root")["asdf"]
branch = tree["myarr.foobars"]

class B(uproot.rootio.ROOTStreamedObject):
    _n_format = struct.Struct(">i")
    _foo_dtype = numpy.dtype(">f8")
    _bar_dtype = numpy.dtype(">i4")

    @classmethod
    def _readinto(cls, self, source, cursor, context, parent):
        start, cnt, self._classversion = uproot.rootio._startcheck(source, cursor)
        cursor.skip(6)                                  # the std::vector header
        self.n = cursor.field(source, self._n_format)   # the std::vector size
        self.foo = cursor.array(source, self.n, self._foo_dtype)
        self.bar = cursor.array(source, self.n, self._bar_dtype)
        uproot.rootio._endcheck(start, cursor, cnt)
        return self

    def __repr__(self):
        return "<B n={0} foo={1} bar={2}>".format(
                   self.n, self.foo.tolist(), self.bar.tolist())

interp = uproot.asgenobj(B, branch._context, skipbytes=0)
branch.array(interp)

The output of the last line is

<ObjectArray [<B n=3 foo=[0.1, 0.2, 0.3] bar=[1, 2, 3]>] at 0x7f792993a390>

(for a file that I made with three foobars in the first event; if we only look at one, we might think that it's unsplit, and you'd hit an error as soon as you looked at something more general).

One aspect of this serialization that I've used (and didn't have to) is the fact that it gives a cross-check on the object size. This is common for objects outside of TTrees, so I'm using the _startcheck/_enccheck from uproot.rootio to do that check. You could skip them with a class/interpretation like this:

class B(uproot.rootio.ROOTStreamedObject):
    _n_format = struct.Struct(">i")
    _foo_dtype = numpy.dtype(">f8")
    _bar_dtype = numpy.dtype(">i4")

    @classmethod
    def _readinto(cls, self, source, cursor, context, parent):
        self.n = cursor.field(source, self._n_format)
        self.foo = cursor.array(source, self.n, self._foo_dtype)
        self.bar = cursor.array(source, self.n, self._bar_dtype)
        return self

    def __repr__(self):
        return "<B n={0} foo={1} bar={2}>".format(
                   self.n, self.foo.tolist(), self.bar.tolist())

interp = uproot.asgenobj(B, branch._context, skipbytes=12)

This gives the same output, does less checking, and may be a little bit faster.

I know that your real data doesn't have foo and bar in it, but I think this example illustrates how you can go about deserializing your case. The cursor.field method (to extract one field) and cursor.fields (to extract a tuple of contiguous fields) use type specifiers from the struct library and cursor.array uses Numpy dtypes. Everything in ROOT serialization is big-endian, so the > is necessary. The cursor is an object that updates its position when you call it (see documentation), while the source is a stateless source of data.

If you want to turn this into a jagged array, see the awkward.fromiter trick I described last time.

Good luck!
(I'm also going to look at the fIndex from your other file.)

@jpivarski
Copy link
Member

As it turns out, the error above is because I was unaware of ROOT's "memberwise splitting," and (if I said anything to the contrary above), it has nothing to do with Boost serialization. This same error came up in 6 different issues, so further discussion on it will be consolidated into scikit-hep/uproot5#38. (This comment is a form message I'm writing on all 6 issues.)

As of PR scikit-hep/uproot5#87, we can now detect such cases, so at least we'll raise a NotImplementedError instead of letting the deserializer fail in mysterious ways. Someday, it will actually be implemented (watch scikit-hep/uproot5#38), but in the meantime, the thing you can do is write your data "objectwise," not "memberwise." (See this comment for ideas on how to do that, and if you manage to do it, you can help a lot of people out by sharing a recipe.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants