Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Xarray sub-package #1013

Merged
merged 19 commits into from
Nov 7, 2024
Merged

Add Xarray sub-package #1013

merged 19 commits into from
Nov 7, 2024

Conversation

vincentsarago
Copy link
Member

@vincentsarago vincentsarago commented Oct 29, 2024

overtake #1007

ref: developmentseed/titiler-xarray#68

To Do

  • reviews
  • add Docs

if "x" not in da.dims and "y" not in da.dims:
try:
latitude_var_name = next(
x for x in ["lat", "latitude", "LAT", "LATITUDE", "Lat"] if x in da.dims
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to support other variable name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say no. Dataset with other names will likely not be regular lat/lon grids and fail in other ways.



@define(kw_only=True)
class TilerFactory(BaseTilerFactory):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By sub-classing titiler.core.factory.TilerFactory we avoid re-writing code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we need to redefine all of the class attributes (e.g. stats_dependency) that are already defined in titiler.core.factory.TilerFactory?


# remove some attribute from init
img_preview_dependency: Type[DefaultDependency] = field(init=False)
add_preview: bool = field(init=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we remove those 2 attributes because we don't support /preview endpoints

return Response(content, media_type=media_type)

# custom /statistics endpoints (remove /statistics - GET)
def statistics(self):
Copy link
Member Author

@vincentsarago vincentsarago Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️ IMO having a full dataset /statistics in a bit dangerous (as for the /preview endpoints) which is why we support only geojson statistics

@vincentsarago

This comment was marked as outdated.

@vincentsarago

This comment was marked as resolved.

@vincentsarago vincentsarago marked this pull request as ready for review October 30, 2024 15:05
Copy link
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work here, @vincentsarago!

This is a very opinionated take, but I think titiler-xarray would be best off with two separate routes, each with its own set of optional dependencies. The first route would be zarr, which would open Zarr and virtual Zarr datasets using xarray.open_zarr. The second route would be md, which would opening any dataset readable by xarray.open_dataset.

The primary reason I think we should do this is that it would enable us to incentivize virtualizing datasets into zarr, which would lead to much faster tile generation. We could do this by:

  1. Having all query parameters in the zarr route only relevant for open_zarr, simplifying API usage.
  2. Automatically detect virtual datasets, removing the need for the reference parameter.
  3. Lightening the image size for titiler-xarray deployments only using zarr because other readers would not be installed (and eventually obstore and/or icechunk could be used instead of the fsspec dependencies)

This would also simplify non-zarr usage for the following reasons:

  1. Zarr specific parameters (e.g., group, consolidated) would not be included in endpoints in the md route
  2. We could use xarray's automatic backend detection rather than writing our own in titiler/xarray/io.py

I also think isolating Zarr usage would simplify the eventual support of the GeoZarr and multiscales specifications.

@vincentsarago
Copy link
Member Author

Thanks @maxrjones 🙏

I see what you're saying. The goal of having a single Reader was to handle all the non-COG dataset so splitting in to two separate reader/set of endpoints would not meat the goal.

This is a very opinionated take, but I think titiler-xarray would be best off with two separate routes, each with its own set of optional dependencies. The first route would be zarr, which would open Zarr and virtual Zarr datasets using xarray.open_zarr. The second route would be md, which would opening any dataset readable by xarray.open_dataset.

We can absolutely use xarray.open_zarr instead of xarray.open_dataset here when reading a zarr

We could use xarray's automatic backend detection rather than writing our own in titiler/xarray/io.py

How so? https://github.com/developmentseed/titiler/pull/1013/files#diff-dd6fab5d1e55a1d860ff8bd2190f145f2574100f734d382cee56c48bd7a7f1f5R43-R49 ?

If I follow your think, it seems we would need a titiler.multidim and a titiler.zarr packages 🤷

What if we make the dependencies optional? I'm going to open a PR on top of this one to try some things

Copy link
Contributor

@hrodmn hrodmn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! The concept of creating pyramids in a zarr store was new to me, then I googled around and found @maxrjones's notebook 😆.

It is great to have the io methods standardized here so we can import them in titiler.cmr and other applications.

src/titiler/xarray/tests/test_factory.py Show resolved Hide resolved
src/titiler/xarray/titiler/xarray/io.py Show resolved Hide resolved
src/titiler/xarray/titiler/xarray/io.py Show resolved Hide resolved
vincentsarago and others added 4 commits October 31, 2024 18:12
* use xarray.open_zarr and make aiohttp and s3fs optional

* add support for references

* tests prefixed protocol

* use tmp_dir for reference

* add parquet support

* remove kerchunk support
):
"""return available variables."""
with self.dataset_opener(src_path, **io_params.as_dict()) as ds:
return list(ds.data_vars) # type: ignore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

easier to put this into an extension than to have a list_variable in a Class Method (because in the class method the dataset_opener is customizable

...


def xarray_open_dataset( # noqa: C901
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved everything within the xarray_open_dataset function to ease the customization and also because I've made som dependencies optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is a fine default option for some applications, but it is easier to customize now because all the user has to do is write a new dataset_opener - nice!

variable: str = attr.ib()

# xarray.Dataset options
opener: Callable[..., xarray.Dataset] = attr.ib(default=xarray_open_dataset)
Copy link
Member Author

@vincentsarago vincentsarago Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now the opener MUST be a callable that take 4 arguments:

  • src_path: str
  • group: Any
  • decode_times: bool
  • cache_client

we might change this in the future

* remove cache layer

* Update src/titiler/xarray/README.md

Co-authored-by: Aimee Barciauskas <[email protected]>

* add tile example

---------

Co-authored-by: Aimee Barciauskas <[email protected]>
Copy link
Contributor

@hrodmn hrodmn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great, thanks for all of your work moving xarray support further upstream! I left a few small comments but nothing blocking.

...


def xarray_open_dataset( # noqa: C901
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is a fine default option for some applications, but it is easier to customize now because all the user has to do is write a new dataset_opener - nice!

Comment on lines +117 to +119
description="RasterIO resampling algorithm. Defaults to `nearest`.",
),
] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the default behavior get set somewhere else or do we need to set it to something besides None here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it defaults to what rio-tiler has (nearest)

dataset_dependency: Type[DefaultDependency] = DatasetParams

# Tile/Tilejson/WMTS Dependencies (Not used in titiler.xarray)
tile_dependency: Type[DefaultDependency] = DefaultDependency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this default to TileParams instead of DefaultDependency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no because there is not buffer nor padding for the XarrayReader's tile method



@define(kw_only=True)
class TilerFactory(BaseTilerFactory):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we need to redefine all of the class attributes (e.g. stats_dependency) that are already defined in titiler.core.factory.TilerFactory?

@vincentsarago
Copy link
Member Author

But do we need to redefine all of the class attributes (e.g. stats_dependency) that are already defined in titiler.core.factory.TilerFactory?

@hrodmn for most of them we don't have to

Copy link
Member

@j08lue j08lue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the io logic, please see comments below. Nothing blocking.

src/titiler/xarray/titiler/xarray/io.py Outdated Show resolved Hide resolved
src/titiler/xarray/titiler/xarray/io.py Outdated Show resolved Hide resolved
src/titiler/xarray/titiler/xarray/io.py Outdated Show resolved Hide resolved
if "x" not in da.dims and "y" not in da.dims:
try:
latitude_var_name = next(
x for x in ["lat", "latitude", "LAT", "LATITUDE", "Lat"] if x in da.dims
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say no. Dataset with other names will likely not be regular lat/lon grids and fail in other ways.

src/titiler/xarray/titiler/xarray/io.py Show resolved Hide resolved
@vincentsarago vincentsarago force-pushed the feature/add-xarray-package branch from da6ce9b to b69591c Compare November 6, 2024 12:29
Copy link
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the pyramid and reference components of the code. Is the current plan to include those in this PR or save them for later development? fwiw I would suggest the latter since there's been movement in GeoZarr multiscales and virtualizarr/icechunk since the original extension was developed

src/titiler/xarray/pyproject.toml Show resolved Hide resolved
@vincentsarago
Copy link
Member Author

@maxrjones we've removed the reference part but kept the group option. There is no pyramid support per-say just some tests fixtures which is used to test the group option

"consolidated": False,
"backend_kwargs": {"consolidated": False},
}
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxrjones oh you were talking about this! I think we can let this go 🙏

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create a new issue to add kerchunk reference!

@j08lue j08lue linked an issue Nov 7, 2024 that may be closed by this pull request
@vincentsarago vincentsarago merged commit d33d60a into main Nov 7, 2024
10 checks passed
@vincentsarago vincentsarago deleted the feature/add-xarray-package branch November 7, 2024 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Google Cloud Storage URLs
4 participants