uproot 3.11.7 fails to read branch with AssertionError #495

davehadley · 2020-06-09T15:20:48Z

uproot fails to read a branch raising an AssertionError from numerical.py line 159.
The branch contains an std::vector containing a custom object inheriting from TObject that contains 3 primitives (1 int, 1 float and 1 double).

The branch interpretation appears to be correct (although I notice that it sets the fBits and fUniqueID to 8 byte rather than the 4 byte that I was expecting).

import uproot

print(uproot.__version__) # prints: 3.11.7

rootfile = uproot.open("example.root")
branch = rootfile["T"]["ev.pmt"]

print(branch.interpretation) # prints: asjagged(astable(asdtype("[(' fBits', '>u8'), (' fUniqueID', '>u8'), ('id', '>i4'), ('charge', '>f4'), ('time', '>f8')]", "[('id', '<i4'), ('charge', '<f4'), ('time', '<f8')]")), 10)

branch.array() # raises: AssertionError (coming from assert reminder ==0 in uproot/interp/numerical.py line 159

An example ROOT file is stored in example-root.zip.

The text was updated successfully, but these errors were encountered:

jpivarski · 2020-06-09T16:27:33Z

Thanks for pointing this out! Apparently, the interpretation is wrong. Depending on timing, this might make it into uproot4 and not uproot3. One of the things I'm building into uproot4 are more "expert tools" for investigating missing or wrong interpretations like this one.

I was looking at this file using uproot.asdebug (one of the things we have in uproot3; needs to be better documented):

>>> branch.array(uproot.asdebug)[0]
array([ 64,   0,   0,  86,  64,   9,   0,   1,   0,   0,   0,   3,   0,
         1,   0,   0,   0,   0,   2,   0,   0,   0,   0,   1,   0,   0,
         0,   0,   2,   0,   0,   0,   0,   1,   0,   0,   0,   0,   2,
         0,   0,   0,   0,   0,  13,  83,   0,   0,  13,  94,   0,   0,
        13, 146,  63, 197, 128, 193,  63, 203,  77,  82,  63, 104, 210,
       116,   0,   0,   0,   0,   0,   0,   0,   0,  64,  77, 148, 232,
        17, 171,  18,   0,  64, 106, 159, 139,   3, 134, 241,   0],
      dtype=uint8)

and I was looking for "charge" with type >f4 to try to find out if we just have a wrong offset or something. I was expecting charges to maybe be -1 or 1 as a way to work backward:

>>> np.array([-1], ">f4").view("u1")
array([191, 128,   0,   0], dtype=uint8)
>>> np.array([1], ">f4").view("u1")
array([ 63, 128,   0,   0], dtype=uint8)

but I don't see anything like that. Do you see any of the values you expect in there?

davehadley · 2020-06-09T16:50:35Z

Thanks for looking at this issue.

In this example file, the charge is charge measured by a photo-multiplier-tube scaled to units of photo-electrons (for this data it should be some +ve floating-point number of order 1 but not exactly).

It is a small file so there is only 1 event with 3 PMT hits. The values I get from a ROOT TTree scan are:

root [9] ((TTree*)_file0->Get("T"))->Scan("ev.pmt.charge:ev.pmt.time:ev.pmt.id")
***********************************************************
*    Row   * Instance * ev.pmt.ch * ev.pmt.ti * ev.pmt.id *
***********************************************************
*        0 *        0 * 1.5429917 *         0 *      3411 *
*        0 *        1 * 1.5882971 * 59.163332 *      3422 *
*        0 *        2 * 0.9094612 * 212.98571 *      3474 *
***********************************************************

So, to answer your question, none of the values in your debug array match the expected values (although I'm not sure how to interpret the debug array). For debugging, it might be better to focus on the ID since it should be exact integers 3411, 3422, or 3474.

davehadley · 2020-06-09T17:52:54Z

I'm not sure if this is useful. But with uproot.asdebug it does seem that there may be some offset issue. Element zero of the debug array, when converted to bits does contain the first hit PMT ID=3411.

For example:

import uproot

rootfile = uproot.open("example.root")
branch = rootfile["T"]["ev.pmt"]

arr = branch.array(uproot.asdebug)[0]

bytestr = "".join(f"{x:08b}" for x in arr)
print(f"3411 (0b{3411:032b}) is at index:", bytestr.find(f"{3411:032b}"))

for bitshift in range(0, 33):
    bytes_ = [bytestr[i:i+32] for i in range(bitshift, len(bytestr), 32)]
    if any((b==f"{3411:032b}" for b in bytes_)):
        print("shifted by:", bitshift)

prints the output:

3411 (0b00000000000000000000110101010011) is at index: 336 
shifted by: 16

jpivarski · 2020-06-09T18:02:44Z

Thanks—that's right; I could have used Scan. It looks like this is not ROOT serialized: there are a few libraries that use Boost serialization inside of a ROOT file, for example, #475 (comment) and #403 (comment). The thing that gives it away is that unsplit ROOT data interleaves the serialization of fields in a struct while Boost serialization puts all values of one field together before moving on to the next.

I don't plan to support Boost serialization, and I don't know if it can even be detected. I think the C++ libraries that usually load these data override the TStreamers that are included in the file, meaning that there is no way, looking only at the file, to know how to deserialize them. If they can be read in ROOT without .L loading a library, then I stand corrected and I'll need to see how ROOT knows that this is serialized in such a wildly different way.

For your example, an entry of data can be deserialized like this:

>>> entry_number = 0     # the entry we want to read
>>> debug_array = branch.array(uproot.asdebug)
>>> debug_entry = debug_array[entry_number]

>>> pos = 8              # some outer header
>>> length = debug_entry[pos : pos + 4].view(">i4")[0]; pos += 4
>>> length
3
>>> pos += 6             # some inner header

>>> fBits = debug_entry[pos : pos + length*4].view(">u4"); pos += length*4
>>> fBits                # but you don't care about the fBits
array([33554432,    65536,      512], dtype=uint32)

>>> fUniqueID = debug_entry[pos : pos + length*4].view(">u4"); pos += length*4
>>> fUniqueID            # but you don't care about the fUniqueID
array([       1,        0, 33554432], dtype=uint32)

>>> id = debug_entry[pos : pos + length*4].view(">i4"); pos += length*4
>>> id
array([3411, 3422, 3474], dtype=int32)

>>> charge = debug_entry[pos : pos + length*4].view(">f4"); pos += length*4
>>> charge
array([1.5429918 , 1.5882971 , 0.90946126], dtype=float32)

>>> time = debug_entry[pos : pos + length*8].view(">f8"); pos += length*8
>>> time
array([  0.        ,  59.16333218, 212.98571946])

>>> assert pos == len(debug_entry)    # Did we use all the bytes? Good.

This would have to be a Python for loop over all entries because of the way the fields are interleaved—it can't be a NumPy all-at-once operation.

If this is not Boost serialization, or there's some indicator in the ROOT file specifying that we should follow this very different kind of deserialization algorithm, then I'll have to figure out what that indicator is. As pointed out above, there have been several files so far with this weird feature.

Your new message crossed in the mail—I'll check it out.

davehadley · 2020-06-09T18:31:15Z

Thanks again for looking at this.

The file can be read in ROOT without loading libraries with ".L". By "can be read" I mean that I can open the file in a TBrowser and double click branches to produce plots. Although it does complain about missing dictionaries when you do this.

I don't know if Boost serialization or something else weird is being done to write these files. It is produced by experiment software that I don't control. I will have to delve into the code and try and figure it out what is being done.

jpivarski · 2020-06-09T19:29:53Z

In other cases that looked like this, it was Boost serialization used in custom streamers, but if you're able to read these data without any .L libraries, then it must be something implemented in ROOT itself. I wonder why there would be this dramatically different serialization method in the same package. Well, I'll look into it. Thanks!

jpivarski · 2020-08-31T18:28:10Z

As it turns out, the error above is because I was unaware of ROOT's "memberwise splitting," and (if I said anything to the contrary above), it has nothing to do with Boost serialization. This same error came up in 6 different issues, so further discussion on it will be consolidated into scikit-hep/uproot5#38. (This comment is a form message I'm writing on all 6 issues.)

As of PR scikit-hep/uproot5#87, we can now detect such cases, so at least we'll raise a NotImplementedError instead of letting the deserializer fail in mysterious ways. Someday, it will actually be implemented (watch scikit-hep/uproot5#38), but in the meantime, the thing you can do is write your data "objectwise," not "memberwise." (See this comment for ideas on how to do that, and if you manage to do it, you can help a lot of people out by sharing a recipe.)

jpivarski closed this as completed Jun 9, 2020

jpivarski mentioned this issue Jul 2, 2020

Handle ROOT's memberwise splitting scikit-hep/uproot5#38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uproot 3.11.7 fails to read branch with AssertionError #495

uproot 3.11.7 fails to read branch with AssertionError #495

davehadley commented Jun 9, 2020

jpivarski commented Jun 9, 2020

davehadley commented Jun 9, 2020

davehadley commented Jun 9, 2020 •

edited

Loading

jpivarski commented Jun 9, 2020

davehadley commented Jun 9, 2020 •

edited

Loading

jpivarski commented Jun 9, 2020

jpivarski commented Aug 31, 2020

uproot 3.11.7 fails to read branch with AssertionError #495

uproot 3.11.7 fails to read branch with AssertionError #495

Comments

davehadley commented Jun 9, 2020

jpivarski commented Jun 9, 2020

davehadley commented Jun 9, 2020

davehadley commented Jun 9, 2020 • edited Loading

jpivarski commented Jun 9, 2020

davehadley commented Jun 9, 2020 • edited Loading

jpivarski commented Jun 9, 2020

jpivarski commented Aug 31, 2020

davehadley commented Jun 9, 2020 •

edited

Loading

davehadley commented Jun 9, 2020 •

edited

Loading