-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computational performance of iris.cube.Cube.aggregated_by
with lazy data
#5455
Comments
I found this project which seems relevant: https://github.com/xarray-contrib/flox |
@scitools/peloton our standard answer to these problems is that getting Dask to work well is hard, and you should take out the data and use Dask (!) So, that approach probably waits on #5398. But we'd be keen to investigate if that can solve this case! |
Not generally, but I had a look at the use cases in ESMValCore and they are all time-related and the there is always an (almost) repeating structure in the coordinate. In some cases, e.g. day of the month, the structure can easily be made repeating by adding some extra days and then masking out the extra data.
Indeed slicing so all repeats have the same size, masking the extra data points introduced by slicing, and then reshaping and collapsing along the required dimension seems the best solution here. However, I struggled to reshape a cube, is there some feature available to do this that I overlooked? I'm also concerned that getting the right values for the coordinates is tricky when using Maybe the best solution would be to use the algorithm described above internally in
My feeling is that it depends on how numpy actually iterates over the input array when it computes the statistics along a dimension whether or not that will be beneficial. |
Sounds familiar to me as an Xarray user :)
Absolutely. See https://flox.readthedocs.io/en/latest/implementation.html#method-cohorts I recommend strongly you use
|
Current behaviour
The method
iris.cube.Cube.aggregated_by
produces many tiny chunks and a very large graph when used with lazy data. When using this method as part of a larger computation, my dask graphs become so large that the computation fails to run. Even if the computation would run, it would be needlessly slow because of the many tiny chunks.Expected behaviour
The method
iris.cube.Cube.aggregated_by
should respect the chunks of the input data as much as possible and produce a modestly sized dask graph.Example code showing the current behaviour
Here is some example script that demonstrates the issue. The cube in the example represents 250 years of monthly data on a 300 x 300 spatial grid and the script computes the annual mean.
prints the following
The text was updated successfully, but these errors were encountered: