Reading branches with varying jaggedness #550

MoAly98 · 2022-01-25T17:27:03Z

MoAly98
Jan 25, 2022

Hi,

I am running into problems trying to process data from a root file with uproot where the branches read are differently jagged. For example, i want to read leptons_pt branch and jets_pt branch. I need to extract these branches from many files, then join them together so that i have one data structure with the kinematics for all events. I can read the data through awkward, but then I am unable to join awkward arrays from different files on event index (I raised a discussion on the awkward github page). To be clearer, if I have an awkard array from one file

array1 = {'leptons_pt'=[[1,2,3], [4,5]], 'jets_pt', [[7,8,9,10],[11]] } # from file1
array2 = {'leptons_pt'=[[12,13], [14]], 'jets_pt', [[15,16],[17,18]] } # from file2

I am unable to merge them such that I get

array 3 = {'leptons_pt'=[[1,2,3], [4,5], [12,13],[14]], 'jets_pt', [[7,8,9,10],[11],[15,16],[17,18]] }

I have tried to get around this problem by reading data with pandas then using pd.concat or similar functions, but I wasn't able to find a way to get a single pandas dataframe to encapsulate this hierarchial structure with varied indexes. Even if there was a way, I would think that it would involve generating a new index in a multi-index dataframe for each branch -- this would be both redundant and difficult to use since it is hard to keep track of which column follows which index. I would love to be corrected on that If I have a misconception.

One solution is to build a pandas dataframe with column elements being arrays -- this way the dataframe will only worry about the event index. To achieve this I tried the following:

data = uproot.concatenate(file, branches, library='ak')
df = pd.DataFrame(columns =list(data.fields) )
for i, field in enumerate(data.fields):
   df[field] = data[field].tolist()

but the conversion of an awkward array to list seems to be incredibly slow -- O(1min) per branch for 0.5million events for a simple branch with shape (nevents,2). Is it normal for the conversion to take this long? This also makes me wonder if uproot can output the data as a list of lists directly as one of the options?

It's worth noting that when this code is used, which branches will be extracted from the root file is something that varies.

Do you have any suggestions on how to better handle a situation like this without sacrificing performance?

jpivarski · 2022-01-25T22:25:08Z

jpivarski
Jan 25, 2022
Maintainer

What you would get out of files 1 and 2, if you are only using uproot.TTree.arrays like

with uproot.open("file1.root:tree_name") as tree:
    array1 = tree.arrays(filter_name=["leptons_pt", "jets_pt"])
with uproot.open("file2.root:tree_name") as tree:
    array2 = tree.arrays(filter_name=["leptons_pt", "jets_pt"])

is

>>> array1 = ak.Array([{"leptons_pt": [1, 2, 3], "jets_pt": [7, 8, 9, 10]}, {"leptons_pt": [4, 5], "jets_pt": [11]}])
>>> array2 = ak.Array([{"leptons_pt": [12, 13], "jets_pt": [15, 16]}, {"leptons_pt": [14], "jets_pt": [17, 18]}])

That is, uproot.TTree.arrays makes an Awkward Array with a field for each requested TBranch name. You can project-out fields from an array to look at them individually:

>>> array1.leptons_pt
<Array [[1, 2, 3], [4, 5]] type='2 * var * int64'>
>>> array1.jets_pt
<Array [[7, 8, 9, 10], [11]] type='2 * var * int64'>
>>> array2.leptons_pt
<Array [[12, 13], [14]] type='2 * var * int64'>
>>> array2.jets_pt
<Array [[15, 16], [17, 18]] type='2 * var * int64'>

Then, to get array3, you ak.concatenate them:

>>> array3 = ak.concatenate([array1, array2])

which looks like

>>> array3.tolist()
[
    {'leptons_pt': [1, 2, 3], 'jets_pt': [7, 8, 9, 10]},
    {'leptons_pt': [4, 5], 'jets_pt': [11]},
    {'leptons_pt': [12, 13], 'jets_pt': [15, 16]},
    {'leptons_pt': [14], 'jets_pt': [17, 18]},
]

or

>>> array3.leptons_pt
<Array [[1, 2, 3], [4, 5], [12, 13], [14]] type='4 * var * int64'>
>>> array3.jets_pt
<Array [[7, 8, 9, 10], ... 16], [17, 18]] type='4 * var * int64'>

But that's also what uproot.concatenate does. One call to

array3 = uproot.concatenate(["file1.root:tree_name", "file2.root:tree_name"])

and you should be done. I suspect that your troubles with concatenation are solvable: see my answer to your question on scikit-hep/awkward/discussions/1251.

Conversion to Pandas is often the slowest link, though I just re-read your question and you've actually hit a bottleneck in tolist. I'll get to that later. I wouldn't use Pandas just to concatenate here—it's a lot of overhead to convert into and out of it for functionality that is already available in the array form. Worse still, Pandas's data model doesn't allow for columns in the same DataFrame to have different MultiIndexes, so everything lepton-related would have to be in different DataFrames from everything jet-related, which further complicates this potential solution.

About getting data directly from Uproot that is already lists, this is approximately what library="np" does. The NumPy backend doesn't touch anything from Awkward Array or Pandas (which is why Uproot has NumPy as its own strict dependency) and if you have jagged arrays, it returns a NumPy dtype="O" array of NumPy arrays.

That is, instead of this:

>>> import uproot, skhep_testdata
>>> tree = uproot.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]
>>> tree["Muon_Px"].array()
<Array [[-52.9, 37.7], ... 1.14], [23.9]] type='2421 * var * float32'>

you get this:

>>> tree["Muon_Px"].array(library="np")
array([array([-52.899456,  37.73778 ], dtype=float32),
       array([-0.81645936], dtype=float32),
       array([48.98783  ,  0.8275667], dtype=float32), ...,
       array([-29.756786], dtype=float32),
       array([1.1418698], dtype=float32),
       array([23.913206], dtype=float32)], dtype=object)

This NumPy dtype actually just stores pointers to Python objects, so any Python data type can be stored in it.

>>> np.array([1, [1, 2, 3], "hello", tree], dtype="O")
array([1, list([1, 2, 3]), 'hello',
       <TTree 'events' (51 branches) at 0x7f5a6f6500d0>], dtype=object)

In other words, this container type is essentially a generic Sequence: it can be iterated over, accessed by index, changed in place, etc. The only thing that Python lists have and NumPy arrays with dtype="O" lack is the ability to append/extend. But, being a generic sequence and not an array-like block of memory, arrays with dtype="O" also lack NumPy's performance and advanced slicing features—NumPy doesn't know what types of objects such an array contains, so it can't do any deep manipulations.

>>> as_awkward = tree["Muon_Px"].array(library="ak")
>>> as_awkward[as_awkward > 10]
<Array [[37.7], [], [49], ... [], [], [23.9]] type='2421 * var * float32'>

>>> as_numpy = tree["Muon_Px"].array(library="np")
>>> as_numpy[as_numpy > 10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Okay, about tolist being slow. The general design philosophy of Awkward Array is that arrays are large, precompiled operations on large memory blocks are fast, and Python is slow. When you ask ak.this or ak.that to do something, it can do a considerable amount of work setting up to call one precompiled function because the setup time is independent of the size of the array (i.e. time complexity O(1)) and the precompiled function depends on the size of the array (time complexity O(n), where n is the length of the array). That's not the case for an operation that extracts one element, such as my_big_array[42]. Here, the fact that we have a large set-up/tear-down time (microseconds or milliseconds) is a killer because this operation doesn't do anything O(n); it's all set-up/tear-down. Iteration is even worse because it's calling my_big_array[0], my_big_array[1], ...

In the current version of Awkward Array, tolist is implemented through iteration. The idea was that if you're asking for Python objects, then you don't care about performance. However, that's a little too harsh. Although we can't make my_big_array[42] much faster (overriding __getitem__ has a lot of overhead relative to a Python builtin list get-item), tolist does not need to be implemented through iteration. In Awkward version 2.x, it was reimplemented using a different algorithm (inside-out, rather than outside-in), with a substantial speedup:

In [1]: import awkward as ak, numpy as np

In [2]: array_v1 = ak.Array(np.random.normal(0, 1, (500000, 2)))

In [3]: %timeit as_list = array_v1.tolist()
15 s ± 137 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: array_v2 = ak._v2.Array(np.random.normal(0, 1, (500000, 2)))

In [5]: %timeit as_list = array_v2.tolist()
113 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

My computer is apparently faster than yours (15 seconds instead of 1 minute for half a million pairs), but the next major version will make this tolist case 130× faster.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading branches with varying jaggedness #550

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Reading branches with varying jaggedness #550

MoAly98 Jan 25, 2022

Replies: 1 comment

jpivarski Jan 25, 2022 Maintainer

MoAly98
Jan 25, 2022

jpivarski
Jan 25, 2022
Maintainer