Skip to content
This repository has been archived by the owner on Jun 30, 2022. It is now read-only.

Return a Dask array when loading Bedmap2 #45

Merged
merged 8 commits into from
Jun 24, 2019
Merged

Return a Dask array when loading Bedmap2 #45

merged 8 commits into from
Jun 24, 2019

Conversation

santisoler
Copy link
Member

Add chunks argument to fetch_bedmap2 in order to create a Dask array instead of loading the entire dataset into memory.
Also, add keyword arguments to be passed to xarray.open_rasterio function in case any other parameter of the array creation wants to be changed.

Fixes #44

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst.
  • Write detailed docstrings for all functions/methods.
  • If adding new functionality, add an example to the docstring, gallery, and/or tutorials.

@santisoler
Copy link
Member Author

@leouieda I've been experimenting with reading Bedmap2 datasets as Dask arrays. This significantly reduces the consumed memory when fetching the dataset. For example, now I can load more than one dataset (e.g. [bed, geoid, surface, ice]) with minimum RAM requirement.
Although, building the documentation still consumes a lot of memory. The memory population is produced when the gallery plot is generated. I don't have too much knowledge on Dask and how to plot Dask arrays "by chunks", i.e. loading into memory one or a few chunks at a time.
Do you have any idea how to tackle this?

I'm also wondering if we should include some example showing that in order to perform computation involving Dask arrays we must add the .compute() method.

Lastly, I've added Dask as a dependency because CIs were failing due to dask module wasn't found.

@@ -24,7 +24,7 @@
}


def fetch_bedmap2(datasets, *, load=True):
def fetch_bedmap2(datasets, *, load=True, chunks=100, **kwargs):
Copy link
Member Author

@santisoler santisoler May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the default chunks number arbitrarily. Maybe we should increase it to speed up computations by reducing overheads.
Based on Dask Best Practices on arrays, we could assign chunks size of (1000, 1000) taking into account that several datasets may be loaded simultaneously.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santisoler sounds good to me. We can test what works best for this dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@leouieda leouieda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santisoler I don't think there is a way around loading the entire dataset when plotting. What we can do is slice it to a smaller areat and then plot it. Maybe just the peninsula?

@@ -24,7 +24,7 @@
}


def fetch_bedmap2(datasets, *, load=True):
def fetch_bedmap2(datasets, *, load=True, chunks=100, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santisoler sounds good to me. We can test what works best for this dataset.

rockhound/bedmap2.py Outdated Show resolved Hide resolved
@santisoler
Copy link
Member Author

@santisoler I don't think there is a way around loading the entire dataset when plotting. What we can do is slice it to a smaller areat and then plot it. Maybe just the peninsula?

I was thinking that may be a way that matplotlib generates the plot by chunks, like a human would do it (read the value of the array on a pixel, color it, move to the next one and so on) but haven't found anything.

Plotting something smaller is a nice solution. What about reducing the resolution of the image and plot the entire continent? In fact we could generate both plots: one to show the whole data, and the peninsula one to show the resolution of the model.

@leouieda
Copy link
Member

leouieda commented Jun 1, 2019

What about reducing the resolution of the image and plot the entire continent?

It would be best to avoid this. Simply taking every other point is not really good practice in this case since it might cause aliasing. We'd have to filter the dataset first, which I'm not keen on doing. But you could do it with a simple gaussian filter if you want to do that.

@santisoler
Copy link
Member Author

santisoler commented Jun 4, 2019

Ok. I don't really want to cut the grid (the plot looks very good haha), but filtering the grid it's a little out of the scope of this issue.

I'm thinking that the some original grids are given in integers, but when we read them with rasterio we get floats (64 bit floats to be exact).
What about converting these grids into integers.
For some of them we don't need 64bit ints.
For example, for the thickness 16 bits integers are enough: 2^16 = 65536, while the maximum value of the thickness is 4621.

This fix will not only make doc build faster, but will ease the memory consumption for every user.
What do you think?

@leouieda
Copy link
Member

leouieda commented Jun 5, 2019

I'm thinking that the some original grids are given in integers, but when we read them with rasterio we get floats (64 bit floats to be exact).

I was just checking this and the conversion happens on the .where call since np.nan only works on floats and floats are 64bit by default. There is no way of having integers for this since you can't have nans with integers. It would also cause some problems with divisions if users aren't careful. What we can do is use float32 instead to cut the memory down by half (or even float16).

This should be mentioned in the docstring somewhere.

rockhound/bedmap2.py Outdated Show resolved Hide resolved
Co-Authored-By: Leonardo Uieda <[email protected]>
@santisoler
Copy link
Member Author

You are right @leouieda. After I've written the last comment I was wondering if changing the dtype to ints could be problematic when performing mathematical operations with the dataset.
I like the float32 approach. In fact I'm thinking we could add an optional argument for this.
So we can add a warning on the documentation and detail what this dtype optional argument means, which will make very clear what we are doing under the hood.

@leouieda
Copy link
Member

In fact I'm thinking we could add an optional argument for this.

That would be great! Whenever we set the dtype of something, it should always be an optional argument that can be controlled by the user.

@santisoler
Copy link
Member Author

Ok. After re-reading this sentence:

I was just checking this and the conversion happens on the .where call since np.nan only works on floats and floats are 64bit by default.

I noticed that we are in fact changing the data type when we replace the nodatavals with nans.
When loading a dataset with xr.open_rasterio I get the original data-type (e.g. np.int64 for surface and bed, but np.float64 for geoid). Is on the next line where we force them all to be np.float64.

This changes my thoughts about this. I think we should avoid converting the nodatavals to nans in order to keep the original data-type.
The nodatavals is stored as a metadata on each DataArray of the Dataset, so we could let the user to change it if it's necessary for them.
What do you think @leouieda?

PS: I want this little PR to be merged. We could leave the memory consumption of building the bedmap2 example discussion for another issue.

@leouieda
Copy link
Member

@santisoler I agree that we should get this merged first and then consider what to do about the dtypes.

But I don't like the idea of giving users weird nodatavals instead of nans. We can't even plot the data properly without converting to nans so I imagine everyone will always convert to nans, which defeats the purpose of not converting.

@leouieda
Copy link
Member

@santisoler please feel free to merge away 👍

@santisoler
Copy link
Member Author

Ok @leouieda! Thanks for the feedback! I'll merge and then open an issue regarding the dtypes and the memory consumption of the bedmap2 example.

@santisoler santisoler merged commit da79ea6 into master Jun 24, 2019
@santisoler santisoler deleted the bedmap2-dask branch June 24, 2019 12:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decrease memory consumption when reading Bedmap2 dataset
2 participants