-
Notifications
You must be signed in to change notification settings - Fork 14
Return a Dask array when loading Bedmap2 #45
Conversation
@leouieda I've been experimenting with reading Bedmap2 datasets as Dask arrays. This significantly reduces the consumed memory when fetching the dataset. For example, now I can load more than one dataset (e.g. I'm also wondering if we should include some example showing that in order to perform computation involving Dask arrays we must add the Lastly, I've added Dask as a dependency because CIs were failing due to |
rockhound/bedmap2.py
Outdated
@@ -24,7 +24,7 @@ | |||
} | |||
|
|||
|
|||
def fetch_bedmap2(datasets, *, load=True): | |||
def fetch_bedmap2(datasets, *, load=True, chunks=100, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the default chunks number arbitrarily. Maybe we should increase it to speed up computations by reducing overheads.
Based on Dask Best Practices on arrays, we could assign chunks size of (1000, 1000)
taking into account that several datasets may be loaded simultaneously.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santisoler sounds good to me. We can test what works best for this dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santisoler I don't think there is a way around loading the entire dataset when plotting. What we can do is slice it to a smaller areat and then plot it. Maybe just the peninsula?
rockhound/bedmap2.py
Outdated
@@ -24,7 +24,7 @@ | |||
} | |||
|
|||
|
|||
def fetch_bedmap2(datasets, *, load=True): | |||
def fetch_bedmap2(datasets, *, load=True, chunks=100, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santisoler sounds good to me. We can test what works best for this dataset.
Co-Authored-By: Leonardo Uieda <[email protected]>
I was thinking that may be a way that matplotlib generates the plot by chunks, like a human would do it (read the value of the array on a pixel, color it, move to the next one and so on) but haven't found anything. Plotting something smaller is a nice solution. What about reducing the resolution of the image and plot the entire continent? In fact we could generate both plots: one to show the whole data, and the peninsula one to show the resolution of the model. |
It would be best to avoid this. Simply taking every other point is not really good practice in this case since it might cause aliasing. We'd have to filter the dataset first, which I'm not keen on doing. But you could do it with a simple gaussian filter if you want to do that. |
Ok. I don't really want to cut the grid (the plot looks very good haha), but filtering the grid it's a little out of the scope of this issue. I'm thinking that the some original grids are given in integers, but when we read them with rasterio we get floats (64 bit floats to be exact). This fix will not only make doc build faster, but will ease the memory consumption for every user. |
I was just checking this and the conversion happens on the This should be mentioned in the docstring somewhere. |
Co-Authored-By: Leonardo Uieda <[email protected]>
You are right @leouieda. After I've written the last comment I was wondering if changing the dtype to ints could be problematic when performing mathematical operations with the dataset. |
That would be great! Whenever we set the dtype of something, it should always be an optional argument that can be controlled by the user. |
Ok. After re-reading this sentence:
I noticed that we are in fact changing the data type when we replace the This changes my thoughts about this. I think we should avoid converting the PS: I want this little PR to be merged. We could leave the memory consumption of building the bedmap2 example discussion for another issue. |
@santisoler I agree that we should get this merged first and then consider what to do about the dtypes. But I don't like the idea of giving users weird nodatavals instead of nans. We can't even plot the data properly without converting to nans so I imagine everyone will always convert to nans, which defeats the purpose of not converting. |
@santisoler please feel free to merge away 👍 |
Ok @leouieda! Thanks for the feedback! I'll merge and then open an issue regarding the dtypes and the memory consumption of the bedmap2 example. |
Add
chunks
argument tofetch_bedmap2
in order to create a Dask array instead of loading the entire dataset into memory.Also, add keyword arguments to be passed to
xarray.open_rasterio
function in case any other parameter of the array creation wants to be changed.Fixes #44
Reminders
make format
andmake check
to make sure the code follows the style guide.doc/api/index.rst
.