-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(draft): add report=
argument to uproot.dask
; triggers report collection for failed reads
#1050
feat(draft): add report=
argument to uproot.dask
; triggers report collection for failed reads
#1050
Conversation
report=
argument to uproot.dask
; triggers report collection for failed readsreport=
argument to uproot.dask
; triggers report collection for failed reads
We should allow the user to configure the kinds of exceptions that cause empty outputs for a partition rather than hardcoding Exception. Probably defaulted to (OSError, FileNotFoundError)? |
How should we pass that in, as the |
I suppose there should be an argument for custom report fillers and such too? |
That or we have |
It's also worthwhile to get the report on the same side—left or right—as it is in uproot.iterate:
So, report goes on the right if it's a 2-tuple. |
We should have |
Giving this a try and right now I get (this is using dak on the failure-report branch rebased to main, this branch uproot):
|
Ah yes the function classes need a mock_empty along with mock. I'll tinker with it tomorrow |
Tried out the latest branch with the following code: from coffea.nanoevents import NanoEventsFactory, NanoAODSchema
#from distributed import Client
import dask
#import dask_awkward as dak
if __name__ == "__main__":
#client = Client()
#dask.config.set({"awkward.optimization.enabled": True, "awkward.raise-failed-meta": True, "awkward.optimization.on-fail": "raise"})
events, report = NanoEventsFactory.from_root(
["tests/samples/nano_dy.root:Events", "/not/actually/a/root/file.root:Events"],
metadata={"dataset": "nano_dy"},
schemaclass=NanoAODSchema,
uproot_options={"report": True}
).events()
pt, creport = dask.compute(events.Muon.pt, report) the result was:
|
Using uproot alone: import uproot
import dask
if __name__ == "__main__":
events, report = uproot.dask(
["tests/samples/nano_dy.root:Events", "/not/actually/a/root/file.root:Events"],
report=True,
open_files=False,
)
pt, creport = dask.compute(events.Muon_pt, report) yields the exact same error. |
OK After doing that I'm able to get:
tadaa! |
So I can see the utility already of ReportSpec. I think it makes sense that the default report for uproot doesn't fill any information for successful file opening/reading, and in fact we should fill with However, for debugging system-level performance it would be useful to fill every report, regardless of success or failure, with step info (args I guess), bytes read, time taken to read. Adding the exception information when it happens. So I think we should come out of the box with ReportSpec so that folks can dive into debugging/optimizing their clusters if they find performance issues. Possibly even providing a report spec that does what I describe above so it is at hand? |
whoops on missing the backend argument! Fixed in the PR now. On the topic of a report spec: I still need to think about how to make it possible to get a fully customize-able report record that isn't limited in the information it can use to build itself out. We'll likely need users to implement this stuff in their on |
Co-authored-by: Lindsey Gray <[email protected]>
did a test using the full nanoevents machinery and it looks like we're good! report works as expected, no strange side effects. |
@douglasdavis I think the direction you're going is fine in this case, the "user" I believe can be uproot since we have to implement our own thing anyway. We can then present a more black-boxy interface to folks using the library. I think that's OK especially since it's a not-typically user-facing part of dask-awkward. |
Closed in favor of #1058. |
This PR pairs with dask-contrib/dask-awkward#415. It adds a new
report
argument touproot.dask
. It causes the call to return two collections, the actual array of interest represented by the first collection, and a report collection as the second return.As it stands if
report
is notNone
we create a report collection; it's designed to be computed simultaneously with whatever collection of the interest is. An example where we have two field access calls and then adak.max
call:If any of the files at that path end up as failure-to-read (raise an exception), the the
measurement
collection will have those partitions get computed to an empty array, so theresult
array object will be pieced together from a concatenation where some of the ingredients are empty, and thereport
array object will be an awkward array of lengthdataset.npartitions
, (one element in the array per partition)We have some flexibility here with the
report=
argument. Thefrom_map
API in dask-awkward via dask-contrib/dask-awkward#415 has 4 arguments that can steer the behavior hereNone
)cc @lgray