Skip to content

Commit

Permalink
adds docs for open_mfdataset
Browse files Browse the repository at this point in the history
  • Loading branch information
lazarusA committed Dec 16, 2024
1 parent ada0f7b commit cfff98d
Showing 1 changed file with 91 additions and 6 deletions.
97 changes: 91 additions & 6 deletions docs/src/UserGuide/read.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,17 @@

This section describes how to read files, URLs, and directories into YAXArrays and datasets.

## Read Zarr
## open_dataset

The usual method for reading any format is using this function. See its `docstring` for more information.

````@docs
open_dataset
````

Now, let's explore different examples.

### Read Zarr

Open a Zarr store as a `Dataset`:

Expand All @@ -23,7 +33,7 @@ Individual arrays can be accessed using subsetting:
ds.tas
````

## Read NetCDF
### Read NetCDF

Open a NetCDF file as a `Dataset`:

Expand Down Expand Up @@ -55,7 +65,7 @@ end

This code will ensure that the data is only accessed by one thread at a time, i.e. making it actual single-threaded but thread-safe.

## Read GDAL (GeoTIFF, GeoJSON)
### Read GDAL (GeoTIFF, GeoJSON)

All GDAL compatible files can be read as a `YAXArrays.Dataset` after loading [ArchGDAL](https://yeesian.com/ArchGDAL.jl/latest/):

Expand All @@ -68,11 +78,11 @@ path = download("https://github.com/yeesian/ArchGDALDatasets/raw/307f8f0e584a39a
ds = open_dataset(path)
````

## Load data into memory
### Load data into memory

For datasets or variables that could fit in RAM, you might want to load them completely into memory. This can be done using the `readcubedata` function. As an example, let's use the NetCDF workflow; the same should be true for other cases.

### readcubedata
#### readcubedata

:::tabs

Expand All @@ -99,4 +109,79 @@ ds_loaded["tos"] # Load the variable of interest; the loaded status is shown for

:::

Note how the loading status changes from `loaded lazily` to `loaded in memory`.
Note how the loading status changes from `loaded lazily` to `loaded in memory`.

## open_mfdataset

There are situations when we would like to open and concatenate a list of dataset paths along a certain dimension. For example, to concatenate a list of `NetCDF` files along a new `time` dimension, one can use:

::: details creation of NetCDF files

````@example open_list_netcdf
using YAXArrays, NetCDF, Dates
using YAXArrays: YAXArrays as YAX
dates_1 = [Date(2020, 1, 1) + Dates.Day(i) for i in 1:3]
dates_2 = [Date(2020, 1, 4) + Dates.Day(i) for i in 1:3]
a1 = YAXArray((lon(1:5), lat(1:7)), rand(5, 7))
a2 = YAXArray((lon(1:5), lat(1:7)), rand(5, 7))
a3 = YAXArray((lon(1:5), lat(1:7), YAX.time(dates_1)), rand(5, 7, 3))
a4 = YAXArray((lon(1:5), lat(1:7), YAX.time(dates_2)), rand(5, 7, 3))
savecube(a1, "a1.nc")
savecube(a2, "a2.nc")
savecube(a3, "a3.nc")
savecube(a4, "a4.nc")
````
:::

### along a new dimension

````@example open_list_netcdf
using YAXArrays, NetCDF, Dates
using YAXArrays: YAXArrays as YAX
import DimensionalData as DD
files = ["a1.nc", "a2.nc"]
dates_read = [Date(2024, 1, 1) + Dates.Day(i) for i in 1:2]
ds = open_mfdataset(DD.DimArray(files, YAX.time(dates_read)))
````

and even opening files along a new `Time` dimension that already have a `time` dimension

````@example open_list_netcdf
files = ["a3.nc", "a4.nc"]
ds = open_mfdataset(DD.DimArray(files, YAX.Time(dates_read)))
````

Note that opening along a new dimension name without specifying values also works; however, it defaults to `1:length(files)` for the dimension values.

````@example open_list_netcdf
files = ["a1.nc", "a2.nc"]
ds = open_mfdataset(DD.DimArray(files, YAX.time))
````

### along a existing dimension

Another use case is when we want to open files along an existing dimension. In this case, `open_mfdataset` will concatenate the paths along the specified dimension

````@example open_list_netcdf
using YAXArrays, NetCDF, Dates
using YAXArrays: YAXArrays as YAX
import DimensionalData as DD
files = ["a3.nc", "a4.nc"]
ds = open_mfdataset(DD.DimArray(files, YAX.time()))
````

where the contents of the `time` dimension are the merged values from both files

````@ansi open_list_netcdf
ds["time"]
````

providing us with a wide range of options to work with.

0 comments on commit cfff98d

Please sign in to comment.