Return a Dask array when loading Bedmap2 #45

santisoler · 2019-05-28T12:30:55Z

Add chunks argument to fetch_bedmap2 in order to create a Dask array instead of loading the entire dataset into memory.
Also, add keyword arguments to be passed to xarray.open_rasterio function in case any other parameter of the array creation wants to be changed.

Fixes #44

Reminders

Run make format and make check to make sure the code follows the style guide.
Add tests for new features or tests that would have caught the bug that you're fixing.
Add new public functions/methods/classes to doc/api/index.rst.
Write detailed docstrings for all functions/methods.
If adding new functionality, add an example to the docstring, gallery, and/or tutorials.

santisoler · 2019-05-28T13:58:51Z

@leouieda I've been experimenting with reading Bedmap2 datasets as Dask arrays. This significantly reduces the consumed memory when fetching the dataset. For example, now I can load more than one dataset (e.g. [bed, geoid, surface, ice]) with minimum RAM requirement.
Although, building the documentation still consumes a lot of memory. The memory population is produced when the gallery plot is generated. I don't have too much knowledge on Dask and how to plot Dask arrays "by chunks", i.e. loading into memory one or a few chunks at a time.
Do you have any idea how to tackle this?

I'm also wondering if we should include some example showing that in order to perform computation involving Dask arrays we must add the .compute() method.

Lastly, I've added Dask as a dependency because CIs were failing due to dask module wasn't found.

santisoler · 2019-05-28T14:04:34Z

rockhound/bedmap2.py

@@ -24,7 +24,7 @@
 }


-def fetch_bedmap2(datasets, *, load=True):
+def fetch_bedmap2(datasets, *, load=True, chunks=100, **kwargs):


I set the default chunks number arbitrarily. Maybe we should increase it to speed up computations by reducing overheads.
Based on Dask Best Practices on arrays, we could assign chunks size of (1000, 1000) taking into account that several datasets may be loaded simultaneously.

What do you think?

@santisoler sounds good to me. We can test what works best for this dataset.

leouieda

@santisoler I don't think there is a way around loading the entire dataset when plotting. What we can do is slice it to a smaller areat and then plot it. Maybe just the peninsula?

leouieda · 2019-05-28T21:07:16Z

rockhound/bedmap2.py

@@ -24,7 +24,7 @@
 }


-def fetch_bedmap2(datasets, *, load=True):
+def fetch_bedmap2(datasets, *, load=True, chunks=100, **kwargs):


@santisoler sounds good to me. We can test what works best for this dataset.

rockhound/bedmap2.py

Co-Authored-By: Leonardo Uieda <[email protected]>

santisoler · 2019-05-29T16:37:34Z

@santisoler I don't think there is a way around loading the entire dataset when plotting. What we can do is slice it to a smaller areat and then plot it. Maybe just the peninsula?

I was thinking that may be a way that matplotlib generates the plot by chunks, like a human would do it (read the value of the array on a pixel, color it, move to the next one and so on) but haven't found anything.

Plotting something smaller is a nice solution. What about reducing the resolution of the image and plot the entire continent? In fact we could generate both plots: one to show the whole data, and the peninsula one to show the resolution of the model.

leouieda · 2019-06-01T00:48:53Z

What about reducing the resolution of the image and plot the entire continent?

It would be best to avoid this. Simply taking every other point is not really good practice in this case since it might cause aliasing. We'd have to filter the dataset first, which I'm not keen on doing. But you could do it with a simple gaussian filter if you want to do that.

santisoler · 2019-06-04T14:54:13Z

Ok. I don't really want to cut the grid (the plot looks very good haha), but filtering the grid it's a little out of the scope of this issue.

I'm thinking that the some original grids are given in integers, but when we read them with rasterio we get floats (64 bit floats to be exact).
What about converting these grids into integers.
For some of them we don't need 64bit ints.
For example, for the thickness 16 bits integers are enough: 2^16 = 65536, while the maximum value of the thickness is 4621.

This fix will not only make doc build faster, but will ease the memory consumption for every user.
What do you think?

leouieda · 2019-06-05T00:26:11Z

I'm thinking that the some original grids are given in integers, but when we read them with rasterio we get floats (64 bit floats to be exact).

I was just checking this and the conversion happens on the .where call since np.nan only works on floats and floats are 64bit by default. There is no way of having integers for this since you can't have nans with integers. It would also cause some problems with divisions if users aren't careful. What we can do is use float32 instead to cut the memory down by half (or even float16).

This should be mentioned in the docstring somewhere.

rockhound/bedmap2.py

Co-Authored-By: Leonardo Uieda <[email protected]>

santisoler · 2019-06-05T13:43:54Z

You are right @leouieda. After I've written the last comment I was wondering if changing the dtype to ints could be problematic when performing mathematical operations with the dataset.
I like the float32 approach. In fact I'm thinking we could add an optional argument for this.
So we can add a warning on the documentation and detail what this dtype optional argument means, which will make very clear what we are doing under the hood.

leouieda · 2019-06-21T01:40:35Z

In fact I'm thinking we could add an optional argument for this.

That would be great! Whenever we set the dtype of something, it should always be an optional argument that can be controlled by the user.

santisoler · 2019-06-21T13:15:37Z

Ok. After re-reading this sentence:

I was just checking this and the conversion happens on the .where call since np.nan only works on floats and floats are 64bit by default.

I noticed that we are in fact changing the data type when we replace the nodatavals with nans.
When loading a dataset with xr.open_rasterio I get the original data-type (e.g. np.int64 for surface and bed, but np.float64 for geoid). Is on the next line where we force them all to be np.float64.

This changes my thoughts about this. I think we should avoid converting the nodatavals to nans in order to keep the original data-type.
The nodatavals is stored as a metadata on each DataArray of the Dataset, so we could let the user to change it if it's necessary for them.
What do you think @leouieda?

PS: I want this little PR to be merged. We could leave the memory consumption of building the bedmap2 example discussion for another issue.

leouieda · 2019-06-22T02:36:02Z

@santisoler I agree that we should get this merged first and then consider what to do about the dtypes.

But I don't like the idea of giving users weird nodatavals instead of nans. We can't even plot the data properly without converting to nans so I imagine everyone will always convert to nans, which defeats the purpose of not converting.

leouieda · 2019-06-22T02:50:40Z

@santisoler please feel free to merge away 👍

santisoler · 2019-06-24T12:38:06Z

Ok @leouieda! Thanks for the feedback! I'll merge and then open an issue regarding the dtypes and the memory consumption of the bedmap2 example.

santisoler added 4 commits May 28, 2019 09:26

Add chunks and **kwargs args to fetch_bedmap2

9260658

Improve Bedmap2 docstring

16fc47d

Add url to Dask array on fetch_bedmap2 docstring

0d26052

Add dask as a dependency

6d01e5d

santisoler commented May 28, 2019

View reviewed changes

leouieda suggested changes May 28, 2019

View reviewed changes

Change kwargs parameter in docstring

7fc4de7

Co-Authored-By: Leonardo Uieda <[email protected]>

santisoler added 2 commits May 29, 2019 13:41

Change default chunks size on fetch_bedmap2

e790977

Update memory warning

5b5763b

leouieda reviewed Jun 5, 2019

View reviewed changes

rockhound/bedmap2.py Outdated Show resolved Hide resolved

Fix docstring

b5c41be

Co-Authored-By: Leonardo Uieda <[email protected]>

leouieda approved these changes Jun 22, 2019

View reviewed changes

santisoler merged commit da79ea6 into master Jun 24, 2019

santisoler deleted the bedmap2-dask branch June 24, 2019 12:42

santisoler mentioned this pull request Jun 24, 2019

Reduce memory consumption when building Bedmap2 example #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return a Dask array when loading Bedmap2 #45

Return a Dask array when loading Bedmap2 #45

santisoler commented May 28, 2019

santisoler commented May 28, 2019

santisoler May 28, 2019 •

edited

Loading

leouieda May 28, 2019

santisoler May 29, 2019

leouieda left a comment

leouieda May 28, 2019

santisoler commented May 29, 2019

leouieda commented Jun 1, 2019

santisoler commented Jun 4, 2019 •

edited

Loading

leouieda commented Jun 5, 2019 •

edited

Loading

santisoler commented Jun 5, 2019

leouieda commented Jun 21, 2019

santisoler commented Jun 21, 2019

leouieda commented Jun 22, 2019

leouieda commented Jun 22, 2019

santisoler commented Jun 24, 2019

Return a Dask array when loading Bedmap2 #45

Return a Dask array when loading Bedmap2 #45

Conversation

santisoler commented May 28, 2019

santisoler commented May 28, 2019

santisoler May 28, 2019 • edited Loading

Choose a reason for hiding this comment

leouieda May 28, 2019

Choose a reason for hiding this comment

santisoler May 29, 2019

Choose a reason for hiding this comment

leouieda left a comment

Choose a reason for hiding this comment

leouieda May 28, 2019

Choose a reason for hiding this comment

santisoler commented May 29, 2019

leouieda commented Jun 1, 2019

santisoler commented Jun 4, 2019 • edited Loading

leouieda commented Jun 5, 2019 • edited Loading

santisoler commented Jun 5, 2019

leouieda commented Jun 21, 2019

santisoler commented Jun 21, 2019

leouieda commented Jun 22, 2019

leouieda commented Jun 22, 2019

santisoler commented Jun 24, 2019

santisoler May 28, 2019 •

edited

Loading

santisoler commented Jun 4, 2019 •

edited

Loading

leouieda commented Jun 5, 2019 •

edited

Loading