Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newbettersaving #150

Merged
merged 41 commits into from
Aug 19, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
167e3d2
First iteration of new savedataset
meggart May 2, 2022
86cbd22
New model to save cubes
meggart May 4, 2022
d9b2892
Add some saving examples
meggart May 5, 2022
e81ea39
More docs
meggart May 5, 2022
e73767a
Even more examples
meggart May 6, 2022
191165e
All tests pass
meggart May 6, 2022
8ff1027
Add save tests
meggart May 6, 2022
83a191d
Bug in test
meggart May 6, 2022
c1c0556
Fixing test
meggart May 6, 2022
8225d1f
Penalize buffers larger than the dimlength
felixcremer May 19, 2022
06c9c3e
Add docs for savecube function
felixcremer May 19, 2022
5f8be60
fix
meggart Jun 3, 2022
41010e6
fix bugs
meggart Jun 27, 2022
2d8bdef
remove loadorgenerate for now
meggart Jun 27, 2022
eb9da50
Never allow missing when writing data
meggart Jul 20, 2022
362335c
Bunp YAXArrayBase version
meggart Jul 20, 2022
6f2965c
fix missings
meggart Jul 22, 2022
d09772f
Fix tests
meggart Jul 22, 2022
0b512b6
Remove unnecessary show statements
meggart Aug 16, 2022
e79bc8e
Require YAXArrayBase 0.6
meggart Aug 17, 2022
b017b33
Update docs/src/examples/Saving and rechunking.md
felixcremer Aug 17, 2022
4e5012a
Update docs/src/examples/Saving and rechunking.md
meggart Aug 17, 2022
ff383f3
Update docs/src/examples/Saving and rechunking.md
meggart Aug 17, 2022
bdf7e39
Update docs/src/examples/Saving and rechunking.md
meggart Aug 17, 2022
4ef9baa
Update docs/src/examples/Saving and rechunking.md
meggart Aug 17, 2022
d153278
Update docs/src/examples/Saving and rechunking.md
meggart Aug 17, 2022
d816b19
Update docs/src/examples/Saving and rechunking.md
meggart Aug 17, 2022
e84162e
Update src/Cubes/Cubes.jl
meggart Aug 17, 2022
85b3ca8
Update src/Cubes/Cubes.jl
meggart Aug 17, 2022
bb8469c
Update src/Cubes/Rechunker.jl
meggart Aug 17, 2022
7b30831
Update src/Cubes/Rechunker.jl
meggart Aug 17, 2022
9ab251e
Update src/Cubes/Rechunker.jl
meggart Aug 17, 2022
54d604d
Rename the skeleton_only keyword to skeleton
felixcremer Aug 18, 2022
b598825
Put the skeleton keyword back into append_dataset
felixcremer Aug 18, 2022
7ef3bb8
Use eachchunk(c) instead of c.chunks
felixcremer Aug 18, 2022
b1aba71
Use the max_cache from YAXDefaults in the Rechunker scripts
felixcremer Aug 18, 2022
c0adda1
Switch to explicit using for ProgressMeter
felixcremer Aug 18, 2022
b317239
Add test for non-missing append_dataset
felixcremer Aug 18, 2022
24ef5ed
Fully remove lines related to @loadorgenerate
felixcremer Aug 18, 2022
a34b149
Update src/DatasetAPI/Datasets.jl
meggart Aug 19, 2022
0ee0d8e
Update src/DatasetAPI/Datasets.jl
meggart Aug 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "YAXArrays"
uuid = "c21b50f5-aa40-41ea-b809-c0f5e47bfa5c"
authors = ["Fabian Gans <[email protected]>"]
version = "0.3.0"
version = "0.4.0"

[deps]
CFTime = "179af706-886a-5703-950a-314cd64e0468"
Expand All @@ -21,6 +21,7 @@ IterTools = "c8e1da08-722c-5040-9ed9-7db0dc04731e"
Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
OffsetArrays = "6fe1bfb0-de20-5000-8ca7-80f57d26f881"
OnlineStats = "a15396b6-48d5-5d58-9928-6d29437db91e"
Optim = "429524aa-4258-5aef-a3af-852621145aeb"
OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
ParallelUtilities = "fad6cfc8-4f83-11e9-06cc-151124046ad0"
ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"
Expand Down Expand Up @@ -55,5 +56,5 @@ Requires = "1"
StatsBase = "0.32, 0.33"
Tables = "0.2, 1.0"
WeightedOnlineStats = "0.3, 0.4, 0.5, 0.6"
YAXArrayBase = "0.4"
YAXArrayBase = "0.5"
meggart marked this conversation as resolved.
Show resolved Hide resolved
julia = "1.6"
185 changes: 185 additions & 0 deletions docs/src/examples/Saving and rechunking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# Saving and and Rechunking Datasets and YAXArrays

## Saving

### Saving a YAXArray to Zarr

One can save any `YAXArray` using the `savecube` function. Simply add a path as an argument and the cube will be saved.

````@jldoctest
julia> using YAXArrays, Zarr, NetCDF
meggart marked this conversation as resolved.
Show resolved Hide resolved

julia> a = YAXArray(rand(10,20));

julia> f = tempname();

julia> savecube(a,f,driver=:zarr)
YAXArray with the following dimensions
Dim_1 Axis with 10 Elements from 1 to 10
Dim_2 Axis with 20 Elements from 1 to 20
Total size: 1.56 KB
````


If case the pathname ends with ".zarr", the driver argument can be omitted.
felixcremer marked this conversation as resolved.
Show resolved Hide resolved

### Saving a YAXArray to NetCDF

Saving to NetCDF works exactly the same way. The `driver` argument can be omitted when the filename ends with ".nc"

````@jldoctest
julia> using YAXArrays, Zarr, NetCDF

julia> a = YAXArray(rand(10,20));

julia> f = tempname();

julia> savecube(a,f,driver=:netcdf)
YAXArray with the following dimensions
Dim_1 Axis with 10 Elements from 1 to 10
Dim_2 Axis with 20 Elements from 1 to 20
Total size: 1.56 KB
````

### Saving a Dataset

Saving Datasets can be done using the `savedataset` function.

````@jldoctest saveds
julia> using YAXArrays, Zarr

julia> ds = Dataset(x = YAXArray(rand(10,20)), y = YAXArray(rand(10)));

julia> f = tempname();

julia> savedataset(ds,path=f,driver=:zarr)
YAXArray Dataset
Dimensions:
Dim_2 Axis with 20 Elements from 1 to 20
Dim_1 Axis with 10 Elements from 1 to 10
Variables: x y
````

### Overwriting a Dataset

If a path already exists, an error will be thrown. Set `overwrite=true` to delete the existing dataset>
meggart marked this conversation as resolved.
Show resolved Hide resolved

````@jldoctest saveds
julia> savedataset(ds,path=f,driver=:zarr, overwrite=true)
YAXArray Dataset
Dimensions:
Dim_2 Axis with 20 Elements from 1 to 20
Dim_1 Axis with 10 Elements from 1 to 10
Variables: x y
````

### Appending to a Dataset

New variables can be added to an existing dataset using the `append=true` keyword.

````@jldoctest
julia> ds2 = Dataset(z = YAXArray(rand(10,20,5)));

julia> savedataset(ds2,path=f,backend=:zarr,append=true);

julia> open_dataset(f, driver=:zarr)
YAXArray Dataset
Dimensions:
Dim_2 Axis with 20 Elements from 1 to 20
Dim_1 Axis with 10 Elements from 1 to 10
Dim_3 Axis with 5 Elements from 1 to 5
Variables: x z y
````

### Creating a Dataset without writing the actual data
meggart marked this conversation as resolved.
Show resolved Hide resolved

Sometimes one merely wants to create a Dataset "Skeleton" on disk and gradually fill it with data.
meggart marked this conversation as resolved.
Show resolved Hide resolved
Here we create Dataset and write only the axis data and array metadata, while no actual array data is
meggart marked this conversation as resolved.
Show resolved Hide resolved
copied:

````@jldoctest
julia> using YAXArrays, Zarr

julia> a = YAXArray(zeros(Union{Missing, Int32},10,20))
YAXArray with the following dimensions
Dim_1 Axis with 10 Elements from 1 to 10
Dim_2 Axis with 20 Elements from 1 to 20
Total size: 800.0 bytes


julia> f = tempname();

julia> r = savecube(a,f,driver=:zarr,skeleton_only=true);
felixcremer marked this conversation as resolved.
Show resolved Hide resolved

julia> all(ismissing,r[:,:])
true
````

The `skeleton_only` argument is also available for `savedataset`.

## Rechunking

### Saving a YAXArray with user-defined chunks

To determine the chunk size of the array representation on disk, call the `setchunks` function prior to saving:

````@jldoctest chunks1
julia> using YAXArrays, Zarr, NetCDF

julia> a = YAXArray(rand(10,20));

julia> f = tempname();

julia> a_chunked = setchunks(a,(5,10));

julia> savecube(a_chunked,f,backend=:zarr);

julia> Cube(f).chunks
2×2 DiskArrays.GridChunks{2}:
(1:5, 1:10) (1:5, 11:20)
(6:10, 1:10) (6:10, 11:20)
````

Alternatively chunk sizes can be given by dimension name, so the following results in the same chunks:

````@jldoctest chunks1
meggart marked this conversation as resolved.
Show resolved Hide resolved
a_chunked = setchunks(a,(Dim_2=10, Dim_1=5));
````

## Rechunking Datasets

### Set Chunks by Axis

Set chunk size for each axis occuring in a dataset. This will be applied to all variables in the dataset:

````@jldoctest
using YAXArrays, Zarr
ds = Dataset(x = YAXArray(rand(10,20)), y = YAXArray(rand(10)), z = YAXArray(rand(10,20,5)));
dschunked = setchunks(ds,Dict("Dim_1"=>5, "Dim_2"=>10, "Dim_3"=>2));
f = tempname();
savedataset(dschunked,path=f,driver=:zarr)
````

### Set chunking by Variable

The following will set the chunk size for each Variable separately and results in exactly the same chunkg as the example above
meggart marked this conversation as resolved.
Show resolved Hide resolved

````@jldoctest
using YAXArrays, Zarr
ds = Dataset(x = YAXArray(rand(10,20)), y = YAXArray(rand(10)), z = YAXArray(rand(10,20,5)));
dschunked = setchunks(ds,(x = (5,10), y = Dict("Dim_1"=>5), z = (Dim_1 = 5, Dim_2 = 10, Dim_3 = 2)));
f = tempname();
savedataset(dschunked,path=f,driver=:zarr)
````

### Set chunking for all variables

The following code snippet only works when all member variables of the dataset have the same shape and sets the output chunks for all arrays.

````@jldoctest
using YAXArrays, Zarr
ds = Dataset(x = YAXArray(rand(10,20)), y = YAXArray(rand(10,20)), z = YAXArray(rand(10,20)));
dschunked = setchunks(ds,(5,10));
f = tempname();
savedataset(dschunked,path=f,driver=:zarr)
````
80 changes: 59 additions & 21 deletions src/Cubes/Cubes.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ The functions provided by YAXArrays are supposed to work on different types of c
Data types that
"""
module Cubes
using DiskArrays: DiskArrays, eachchunk, approx_chunksize, max_chunksize, grid_offset
using DiskArrays: DiskArrays, eachchunk, approx_chunksize, max_chunksize, grid_offset, GridChunks
using Distributed: myid
using Dates: TimeType
using IntervalSets: Interval, (..)
Expand All @@ -14,7 +14,7 @@ import YAXArrayBase: getattributes, iscontdim, dimnames, dimvals, getdata
using DiskArrayTools: CFDiskArray
using DocStringExtensions

export concatenatecubes, caxes, subsetcube, readcubedata, renameaxis!, YAXArray
export concatenatecubes, caxes, subsetcube, readcubedata, renameaxis!, YAXArray, setchunks

"""
This function calculates a subset of a cube's data
Expand Down Expand Up @@ -88,12 +88,13 @@ It can wrap normal arrays or, more typically DiskArrays.
* `axes` a `Vector{CubeAxis}` containing the Axes of the Cube
* `data` N-D array containing the data
"""
struct YAXArray{TypeOfData,NumberOfAxes,A<:AbstractArray{TypeOfData,NumberOfAxes},AxesTypes}
struct YAXArray{T,N,A<:AbstractArray{T,N},AxesTypes}
axes::AxesTypes
data::A
properties::Dict{String}
meggart marked this conversation as resolved.
Show resolved Hide resolved
chunks::GridChunks{N}
cleaner::Vector{CleanMe}
function YAXArray(axes, data, properties, cleaner)
function YAXArray(axes, data, properties, chunks, cleaner)
if ndims(data) != length(axes) # case: mismatched Arguments
throw(
ArgumentError(
Expand All @@ -106,22 +107,28 @@ struct YAXArray{TypeOfData,NumberOfAxes,A<:AbstractArray{TypeOfData,NumberOfAxes
"Can not construct YAXArray, supplied data size is $(size(data)) while axis lenghts are $(ntuple(i->length(axes[i]),ndims(data)))",
),
)
else # case: create new YAXArray
elseif ndims(chunks) != ndims(data)
throw(ArgumentError("Can not construct YAXArray, supplied chunk dimension is $(ndims(chunks)) while the number of dims is $(length(axes))"))
meggart marked this conversation as resolved.
Show resolved Hide resolved
else
return new{eltype(data),ndims(data),typeof(data),typeof(axes)}(
axes,
data,
properties,
chunks,
cleaner,
)
end
end
end
YAXArray(axes, data, properties=Dict{String,Any}(); cleaner=CleanMe[]) =
YAXArray(axes, data, properties, cleaner)

YAXArray(axes, data, properties = Dict{String,Any}(); cleaner = CleanMe[], chunks = eachchunk(data)) =
YAXArray(axes, data, properties, chunks, cleaner)
YAXArray(axes,data,properties,cleaner) = YAXArray(axes,data,properties,eachchunk(data),cleaner)
function YAXArray(x::AbstractArray)
ax = caxes(x)
props = getattributes(x)
return YAXArray(ax, x, props)
chunks = eachchunk(x)
meggart marked this conversation as resolved.
Show resolved Hide resolved
YAXArray(ax, x, props,chunks=chunks)
end

# Base utility overloads
Expand All @@ -146,19 +153,15 @@ function Base.propertynames(a::YAXArray, private::Bool=false)
(axsym.(caxes(a))..., :axes, :data)
end
end
Base.ndims(a::YAXArray{<:Any,NumberOfAxes}) where {NumberOfAxes} = NumberOfAxes
Base.eltype(a::YAXArray{TypeOfData}) where {TypeOfData} = TypeOfData
# really needed? it sounds like bad performance to permute the raw data?
Base.permutedims(c::YAXArray, p) =
YAXArray(caxes(c)[collect(p)], permutedims(getdata(c), p), c.properties, c.cleaner)
Base.getindex(x::YAXArray, i...) = getdata(x)[i...]

"""
caxes(x)

returns the axes of a cube
"""
#TODO: is the general version really needed?
Base.ndims(a::YAXArray{<:Any,N}) where {N} = N
Base.eltype(a::YAXArray{T}) where {T} = T
function Base.permutedims(c::YAXArray, p)
newaxes = caxes(c)[collect(p)]
newchunks = DiskArrays.GridChunks(c.chunks.chunks[collect(p)])
YAXArray(newaxes, permutedims(getdata(c), p), c.properties, newchunks, c.cleaner)
end
function caxes(x)
map(enumerate(dimnames(x))) do a
index, symbol = a
Expand All @@ -177,9 +180,41 @@ function readcubedata(x)
YAXArray(collect(CubeAxis, caxes(x)), getindex_all(x), getattributes(x))
end

cubechunks(c) = approx_chunksize(eachchunk(getdata(c)))
interpret_cubechunks(cs::NTuple{N,Int},cube) where N = DiskArrays.GridChunks(cube.data,cs)
felixcremer marked this conversation as resolved.
Show resolved Hide resolved
meggart marked this conversation as resolved.
Show resolved Hide resolved
interpret_cubechunks(cs::DiskArrays.GridChunks,_) = cs
interpret_dimchunk(cs::Integer,s) = DiskArrays.RegularChunks(cs,0,s)
interpret_dimchunk(cs::DiskArrays.ChunkType, _) = cs

function interpret_cubechunks(cs,cube)
oldchunks = DiskArrays.eachchunk(cube).chunks
for k in keys(cs)
i = findAxis(k,cube)
if i !== nothing
dimchunk = interpret_dimchunk(cs[k],size(cube.data,i))
oldchunks = Base.setindex(oldchunks,dimchunk,i)
end
end
GridChunks(oldchunks)
end

"""
setchunks(c::YAXArray,chunks)

Resets the chunks of a YAXArray and returns a new YAXArray. Note that this will not change the chunking of the underlying data itself,
it will just make the data "look" like it had a different chunking. If you need a persistent on-disk representation
of this chunking, use `savecube` on the resulting array. The `chunks` argument can take one of the following forms:

- a `DiskArrays.GridChunks` object
- a tuple specifying the chunk size along each dimension
- an AbstractDict or NamedTuple mapping one or more axis names to chunk sizes

"""
setchunks(c::YAXArray,chunks) = YAXArray(c.axes,c.data,c.properties,interpret_cubechunks(chunks,c),c.cleaner)
meggart marked this conversation as resolved.
Show resolved Hide resolved
cubechunks(c) = approx_chunksize(c.chunks)
DiskArrays.eachchunk(c) = c.chunks
felixcremer marked this conversation as resolved.
Show resolved Hide resolved
meggart marked this conversation as resolved.
Show resolved Hide resolved
getindex_all(a) = getindex(a, ntuple(_ -> Colon(), ndims(a))...)
chunkoffset(c) = grid_offset(eachchunk(getdata(c)))
Base.getindex(x::YAXArray, i...) = getdata(x)[i...]
chunkoffset(c) = grid_offset(c.chunks)

# Implementation for YAXArrayBase interface
YAXArrayBase.dimvals(x::YAXArray, i) = caxes(x)[i].values
Expand Down Expand Up @@ -312,6 +347,9 @@ function show_yax(io::IO, c)
println(io, "Total size: ", formatbytes(cubesize(c)))
end



include("TransformedCubes.jl")
include("Slices.jl")
include("Rechunker.jl")
end #module
Loading