Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

uproot 3.11.7 fails to read branch with AssertionError #495

Closed
davehadley opened this issue Jun 9, 2020 · 7 comments
Closed

uproot 3.11.7 fails to read branch with AssertionError #495

davehadley opened this issue Jun 9, 2020 · 7 comments

Comments

@davehadley
Copy link

uproot fails to read a branch raising an AssertionError from numerical.py line 159.
The branch contains an std::vector containing a custom object inheriting from TObject that contains 3 primitives (1 int, 1 float and 1 double).

The branch interpretation appears to be correct (although I notice that it sets the fBits and fUniqueID to 8 byte rather than the 4 byte that I was expecting).

import uproot

print(uproot.__version__) # prints: 3.11.7

rootfile = uproot.open("example.root")
branch = rootfile["T"]["ev.pmt"]

print(branch.interpretation) # prints: asjagged(astable(asdtype("[(' fBits', '>u8'), (' fUniqueID', '>u8'), ('id', '>i4'), ('charge', '>f4'), ('time', '>f8')]", "[('id', '<i4'), ('charge', '<f4'), ('time', '<f8')]")), 10)

branch.array() # raises: AssertionError (coming from assert reminder ==0 in uproot/interp/numerical.py line 159

An example ROOT file is stored in example-root.zip.

@jpivarski
Copy link
Member

Thanks for pointing this out! Apparently, the interpretation is wrong. Depending on timing, this might make it into uproot4 and not uproot3. One of the things I'm building into uproot4 are more "expert tools" for investigating missing or wrong interpretations like this one.

I was looking at this file using uproot.asdebug (one of the things we have in uproot3; needs to be better documented):

>>> branch.array(uproot.asdebug)[0]
array([ 64,   0,   0,  86,  64,   9,   0,   1,   0,   0,   0,   3,   0,
         1,   0,   0,   0,   0,   2,   0,   0,   0,   0,   1,   0,   0,
         0,   0,   2,   0,   0,   0,   0,   1,   0,   0,   0,   0,   2,
         0,   0,   0,   0,   0,  13,  83,   0,   0,  13,  94,   0,   0,
        13, 146,  63, 197, 128, 193,  63, 203,  77,  82,  63, 104, 210,
       116,   0,   0,   0,   0,   0,   0,   0,   0,  64,  77, 148, 232,
        17, 171,  18,   0,  64, 106, 159, 139,   3, 134, 241,   0],
      dtype=uint8)

and I was looking for "charge" with type >f4 to try to find out if we just have a wrong offset or something. I was expecting charges to maybe be -1 or 1 as a way to work backward:

>>> np.array([-1], ">f4").view("u1")
array([191, 128,   0,   0], dtype=uint8)
>>> np.array([1], ">f4").view("u1")
array([ 63, 128,   0,   0], dtype=uint8)

but I don't see anything like that. Do you see any of the values you expect in there?

@davehadley
Copy link
Author

Thanks for looking at this issue.

In this example file, the charge is charge measured by a photo-multiplier-tube scaled to units of photo-electrons (for this data it should be some +ve floating-point number of order 1 but not exactly).

It is a small file so there is only 1 event with 3 PMT hits. The values I get from a ROOT TTree scan are:

root [9] ((TTree*)_file0->Get("T"))->Scan("ev.pmt.charge:ev.pmt.time:ev.pmt.id")
***********************************************************
*    Row   * Instance * ev.pmt.ch * ev.pmt.ti * ev.pmt.id *
***********************************************************
*        0 *        0 * 1.5429917 *         0 *      3411 *
*        0 *        1 * 1.5882971 * 59.163332 *      3422 *
*        0 *        2 * 0.9094612 * 212.98571 *      3474 *
***********************************************************

So, to answer your question, none of the values in your debug array match the expected values (although I'm not sure how to interpret the debug array). For debugging, it might be better to focus on the ID since it should be exact integers 3411, 3422, or 3474.

@davehadley
Copy link
Author

davehadley commented Jun 9, 2020

I'm not sure if this is useful. But with uproot.asdebug it does seem that there may be some offset issue. Element zero of the debug array, when converted to bits does contain the first hit PMT ID=3411.

For example:

import uproot

rootfile = uproot.open("example.root")
branch = rootfile["T"]["ev.pmt"]

arr = branch.array(uproot.asdebug)[0]

bytestr = "".join(f"{x:08b}" for x in arr)
print(f"3411 (0b{3411:032b}) is at index:", bytestr.find(f"{3411:032b}"))

for bitshift in range(0, 33):
    bytes_ = [bytestr[i:i+32] for i in range(bitshift, len(bytestr), 32)]
    if any((b==f"{3411:032b}" for b in bytes_)):
        print("shifted by:", bitshift)

prints the output:

3411 (0b00000000000000000000110101010011) is at index: 336 
shifted by: 16

@jpivarski
Copy link
Member

Thanks—that's right; I could have used Scan. It looks like this is not ROOT serialized: there are a few libraries that use Boost serialization inside of a ROOT file, for example, #475 (comment) and #403 (comment). The thing that gives it away is that unsplit ROOT data interleaves the serialization of fields in a struct while Boost serialization puts all values of one field together before moving on to the next.

I don't plan to support Boost serialization, and I don't know if it can even be detected. I think the C++ libraries that usually load these data override the TStreamers that are included in the file, meaning that there is no way, looking only at the file, to know how to deserialize them. If they can be read in ROOT without .L loading a library, then I stand corrected and I'll need to see how ROOT knows that this is serialized in such a wildly different way.

For your example, an entry of data can be deserialized like this:

>>> entry_number = 0     # the entry we want to read
>>> debug_array = branch.array(uproot.asdebug)
>>> debug_entry = debug_array[entry_number]

>>> pos = 8              # some outer header
>>> length = debug_entry[pos : pos + 4].view(">i4")[0]; pos += 4
>>> length
3
>>> pos += 6             # some inner header

>>> fBits = debug_entry[pos : pos + length*4].view(">u4"); pos += length*4
>>> fBits                # but you don't care about the fBits
array([33554432,    65536,      512], dtype=uint32)

>>> fUniqueID = debug_entry[pos : pos + length*4].view(">u4"); pos += length*4
>>> fUniqueID            # but you don't care about the fUniqueID
array([       1,        0, 33554432], dtype=uint32)

>>> id = debug_entry[pos : pos + length*4].view(">i4"); pos += length*4
>>> id
array([3411, 3422, 3474], dtype=int32)

>>> charge = debug_entry[pos : pos + length*4].view(">f4"); pos += length*4
>>> charge
array([1.5429918 , 1.5882971 , 0.90946126], dtype=float32)

>>> time = debug_entry[pos : pos + length*8].view(">f8"); pos += length*8
>>> time
array([  0.        ,  59.16333218, 212.98571946])

>>> assert pos == len(debug_entry)    # Did we use all the bytes? Good.

This would have to be a Python for loop over all entries because of the way the fields are interleaved—it can't be a NumPy all-at-once operation.

If this is not Boost serialization, or there's some indicator in the ROOT file specifying that we should follow this very different kind of deserialization algorithm, then I'll have to figure out what that indicator is. As pointed out above, there have been several files so far with this weird feature.

Your new message crossed in the mail—I'll check it out.

@davehadley
Copy link
Author

davehadley commented Jun 9, 2020

Thanks again for looking at this.

The file can be read in ROOT without loading libraries with ".L". By "can be read" I mean that I can open the file in a TBrowser and double click branches to produce plots. Although it does complain about missing dictionaries when you do this.

I don't know if Boost serialization or something else weird is being done to write these files. It is produced by experiment software that I don't control. I will have to delve into the code and try and figure it out what is being done.

@jpivarski
Copy link
Member

In other cases that looked like this, it was Boost serialization used in custom streamers, but if you're able to read these data without any .L libraries, then it must be something implemented in ROOT itself. I wonder why there would be this dramatically different serialization method in the same package. Well, I'll look into it. Thanks!

@jpivarski
Copy link
Member

As it turns out, the error above is because I was unaware of ROOT's "memberwise splitting," and (if I said anything to the contrary above), it has nothing to do with Boost serialization. This same error came up in 6 different issues, so further discussion on it will be consolidated into scikit-hep/uproot5#38. (This comment is a form message I'm writing on all 6 issues.)

As of PR scikit-hep/uproot5#87, we can now detect such cases, so at least we'll raise a NotImplementedError instead of letting the deserializer fail in mysterious ways. Someday, it will actually be implemented (watch scikit-hep/uproot5#38), but in the meantime, the thing you can do is write your data "objectwise," not "memberwise." (See this comment for ideas on how to do that, and if you manage to do it, you can help a lot of people out by sharing a recipe.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants