Replies: 1 comment
-
What you would get out of files 1 and 2, if you are only using uproot.TTree.arrays like with uproot.open("file1.root:tree_name") as tree:
array1 = tree.arrays(filter_name=["leptons_pt", "jets_pt"])
with uproot.open("file2.root:tree_name") as tree:
array2 = tree.arrays(filter_name=["leptons_pt", "jets_pt"]) is >>> array1 = ak.Array([{"leptons_pt": [1, 2, 3], "jets_pt": [7, 8, 9, 10]}, {"leptons_pt": [4, 5], "jets_pt": [11]}])
>>> array2 = ak.Array([{"leptons_pt": [12, 13], "jets_pt": [15, 16]}, {"leptons_pt": [14], "jets_pt": [17, 18]}]) That is, uproot.TTree.arrays makes an Awkward Array with a field for each requested TBranch name. You can project-out fields from an array to look at them individually: >>> array1.leptons_pt
<Array [[1, 2, 3], [4, 5]] type='2 * var * int64'>
>>> array1.jets_pt
<Array [[7, 8, 9, 10], [11]] type='2 * var * int64'>
>>> array2.leptons_pt
<Array [[12, 13], [14]] type='2 * var * int64'>
>>> array2.jets_pt
<Array [[15, 16], [17, 18]] type='2 * var * int64'> Then, to get >>> array3 = ak.concatenate([array1, array2]) which looks like >>> array3.tolist()
[
{'leptons_pt': [1, 2, 3], 'jets_pt': [7, 8, 9, 10]},
{'leptons_pt': [4, 5], 'jets_pt': [11]},
{'leptons_pt': [12, 13], 'jets_pt': [15, 16]},
{'leptons_pt': [14], 'jets_pt': [17, 18]},
] or >>> array3.leptons_pt
<Array [[1, 2, 3], [4, 5], [12, 13], [14]] type='4 * var * int64'>
>>> array3.jets_pt
<Array [[7, 8, 9, 10], ... 16], [17, 18]] type='4 * var * int64'> But that's also what uproot.concatenate does. One call to array3 = uproot.concatenate(["file1.root:tree_name", "file2.root:tree_name"]) and you should be done. I suspect that your troubles with concatenation are solvable: see my answer to your question on scikit-hep/awkward/discussions/1251. Conversion to Pandas is often the slowest link, though I just re-read your question and you've actually hit a bottleneck in About getting data directly from Uproot that is already lists, this is approximately what That is, instead of this: >>> import uproot, skhep_testdata
>>> tree = uproot.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]
>>> tree["Muon_Px"].array()
<Array [[-52.9, 37.7], ... 1.14], [23.9]] type='2421 * var * float32'> you get this: >>> tree["Muon_Px"].array(library="np")
array([array([-52.899456, 37.73778 ], dtype=float32),
array([-0.81645936], dtype=float32),
array([48.98783 , 0.8275667], dtype=float32), ...,
array([-29.756786], dtype=float32),
array([1.1418698], dtype=float32),
array([23.913206], dtype=float32)], dtype=object) This NumPy >>> np.array([1, [1, 2, 3], "hello", tree], dtype="O")
array([1, list([1, 2, 3]), 'hello',
<TTree 'events' (51 branches) at 0x7f5a6f6500d0>], dtype=object) In other words, this container type is essentially a generic Sequence: it can be iterated over, accessed by index, changed in place, etc. The only thing that Python lists have and NumPy arrays with >>> as_awkward = tree["Muon_Px"].array(library="ak")
>>> as_awkward[as_awkward > 10]
<Array [[37.7], [], [49], ... [], [], [23.9]] type='2421 * var * float32'> >>> as_numpy = tree["Muon_Px"].array(library="np")
>>> as_numpy[as_numpy > 10]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() Okay, about In the current version of Awkward Array, In [1]: import awkward as ak, numpy as np
In [2]: array_v1 = ak.Array(np.random.normal(0, 1, (500000, 2)))
In [3]: %timeit as_list = array_v1.tolist()
15 s ± 137 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: array_v2 = ak._v2.Array(np.random.normal(0, 1, (500000, 2)))
In [5]: %timeit as_list = array_v2.tolist()
113 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) My computer is apparently faster than yours (15 seconds instead of 1 minute for half a million pairs), but the next major version will make this |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am running into problems trying to process data from a root file with
uproot
where the branches read are differently jagged. For example, i want to readleptons_pt
branch andjets_pt
branch. I need to extract these branches from many files, then join them together so that i have one data structure with the kinematics for all events. I can read the data through awkward, but then I am unable to join awkward arrays from different files on event index (I raised a discussion on theawkward
github page). To be clearer, if I have an awkard array from one fileI am unable to merge them such that I get
I have tried to get around this problem by reading data with
pandas
then usingpd.concat
or similar functions, but I wasn't able to find a way to get a singlepandas
dataframe to encapsulate this hierarchial structure with varied indexes. Even if there was a way, I would think that it would involve generating a new index in a multi-index dataframe for each branch -- this would be both redundant and difficult to use since it is hard to keep track of which column follows which index. I would love to be corrected on that If I have a misconception.One solution is to build a
pandas
dataframe with column elements being arrays -- this way the dataframe will only worry about the event index. To achieve this I tried the following:but the conversion of an awkward array to list seems to be incredibly slow -- O(1min) per branch for 0.5million events for a simple branch with shape
(nevents,2)
. Is it normal for the conversion to take this long? This also makes me wonder ifuproot
can output the data as a list of lists directly as one of the options?It's worth noting that when this code is used, which branches will be extracted from the root file is something that varies.
Do you have any suggestions on how to better handle a situation like this without sacrificing performance?
Beta Was this translation helpful? Give feedback.
All reactions