Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usability of this package and the JuliaGeo ecosystem #1

Open
mkborregaard opened this issue Jan 29, 2019 · 62 comments
Open

Usability of this package and the JuliaGeo ecosystem #1

mkborregaard opened this issue Jan 29, 2019 · 62 comments

Comments

@mkborregaard
Copy link
Owner

Ported from Slack. Me:

Over the last week or two I’ve been working on very simple raster implementation in pure julia. The reason for doing that is that when I had a workshop earlier this month, with 30 very interested biogeographers at a conference, I asked them what it would take for them to move to Julia. And they almost all said “convenient raster support”. And - it is my impression that that support is not quite there yet in Julia. Of course you can do amazing things with ArchGDAL, but in truth almost all of my work with rasters involve getting data out of them and using that for analyses, and some plotting.

So, I designed an analysis in R: https://www.dropbox.com/s/n8790r0s63d56hm/R%2Bworked%2Braster%2Bexample.html?dl=0 and set out to create a package that could replicate it in Julia. The result is https://github.com/mkborregaard/VerySimpleRasters.jl and the replicated analysis is here: https://www.dropbox.com/s/d5mc52corz5hm7s/VerySimpleRasters%2Bexample.html?dl=0

All operations happen on disk, out of memory, and the script is very fast even for the relatively big raster in the example (except the predictable waiting time on plots).

There’s still lots to do - so far it uses R’s native raster format as it’s own native format, and supports importing ESRI Ascii grids to that format. Other formats could possibly be provided by GDAL in some way like the old RasterIO package?

I’d really love inputs - how to make this consistent with the Geointerface, is it worth continuing in this direction or have I overlooked some obvious existing capabilities etc.

For anyone wanting to play with the package, notebooks and data, I’ve collected everything in a nice folder here: https://www.dropbox.com/sh/4shiomz5gj4ggx5/AAASQv_K5p2OadhwL-CdRuEja?dl=0

cc @yeesian @visr @c42f

@mkborregaard
Copy link
Owner Author

Then @yeesian:
Yeesian Ng [5:18 AM]
I think the best way forward is to create the package you want to be using for your daily needs, and to just start using it. And feel free to evangelize it to people if it’s something you feel like maintaining and supporting. @juliohm has been developing and maintaining GeoStats.jl successfully for instance. It could be that parts of the JuliaGeo ecosystem may mature and make things easier in the future, but there’s no timeline for it (since there’s no consistent developmental effort that we can count on). So IMO people shouldn’t refrain from working on their own packages for processing rasters/etc in the meantime.
I am interested in the questions you’re bringing up too, e.g. when you write: “Here are a number of issues. First of all, VerySimpleRasters doesn’t know about Shapefiles. It just knows about polygons, and that they are a vector of Tuples. Maybe it would be useful for Shapefiles to define a polygon in accordance with GeometryTypes? Maybe GeometryTypes could satisfy the Geointerface, or the other way around? So, first I have to extract the points as a Tuple. Then, when I plot it, I see there is a problem: There are actually two polygons here, but they are treated as one. Since this breaks my masking code (but not R’s), I have to define a function to split the Polygons whenever a line returns to the starting point.”
Ideally GeoInterface should have a sufficiently granular interface for others to use. The most common comprehensive one that I know of is https://www.opengeospatial.org/standards/sfa. But it will be hard to provide and maintain implementations for the full spec upfront, and it’ll be more reasonable for us to build up to it as we go (according to demonstrated need).

@mkborregaard
Copy link
Owner Author

@yeesian:
In yeesian/ArchGDAL.jl#76 , it is now:

julia> using ArchGDAL
ArchGDAL

julia> dataset = ArchGDAL.read("temperature/bio1.bil")
GDAL Dataset (Driver: MEM/In Memory Raster)
File(s):

Dataset (width x height): 2160 x 900 (pixels)
Number of raster bands: 1
[GA_Update] Band 1 (Undefined): 2160 x 900 (Int16)


julia> ArchGDAL.read(dataset, 1) # reads the first (and only band)
2160×900 Array{Int16,2}:
-9999  -9999  -9999  -9999  -9999  -9999    -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999    -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999    -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
                                                    
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999    -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999    -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999
-9999  -9999  -9999  -9999  -9999  -9999     -9999  -9999  -9999  -9999  -9999  -9999

so the main thing it saves you is not having to write the driver code for parsing the file format
alternatively, you don’t have to check out that PR, and can do the same (more efficiently) with

using ArchGDAL

raster = ArchGDAL.read("temperature/bio1.bil") do dataset
    ArchGDAL.read(dataset, 1)
end

but I don’t think it’ll be as fast as mmap in

function VerySimpleRaster(grdfile::String)
@assert grdfile[end-3:end] (".grd", ".gri")
fname = grdfile[1:end-4]
data = read_grd(fname*".grd")
ncols = parse(Int, data["nrows"])
nrows = parse(Int, data["ncols"])
dt = datatype_translation[data["datatype"]]
fname = abspath(fname*".gri")
mat = open(fname, "r") do IO
Mmap.mmap(IO, Matrix{dt}, (nrows, ncols))
end
xmin, xmax = parse(Float64, data["xmin"]), parse(Float64, data["xmax"])
ymin, ymax = parse(Float64, data["ymin"]), parse(Float64, data["ymax"])
celly = (ymax - ymin) / ncols
cellx = (xmax - xmin) / nrows
VerySimpleRaster(mat, parse(dt, data["nodatavalue"]), xmin, ymin, cellx, celly, data["projection"], fname)
end

@mkborregaard
Copy link
Owner Author

Me:
So could these things be unified? I mean, could a solution be to have GDAL open a mmapped on-disk matrix with metadata (essentially what a VerySimpleRaster is). That would allow us to close the file again (making the operation safe and avoiding the function syntax), and also to treat the raster as a completely normal Julia Matrix (giving access to e.g. the entire Images stack for analysis). And if it is necessary to use GDAL to work directly on the on-disk representation, the VerySimpleRaster format keeps path to the input file, so functions on VerySimpleRaster could open the file, do the operation with GDAL, then close it again. Creating a seamless on-disk experience.

@mkborregaard
Copy link
Owner Author

@visr :
Thing with gdal and mmap is that gdal supports many formats, only a few of them have binary blocks that can be fully memory mapped. Some formats are ascii, or have compression, tiling etc, so their disk representation is not the same as in memory representation.
This is why the ArchGDAL example shows (Driver: MEM/In Memory Raster)

It's the reason we need GDAL as an abstraction layer if we want to support so many formats
That of course leaves room for packages like VerySimpleRasters that only support some formats, for memory mapping reasons.

@mkborregaard
Copy link
Owner Author

Great! And yes agreed, I’d be perfectly content to support the mmappable ones natively and provide a conversion for the others (via GDAL most likely)

@yeesian
Copy link

yeesian commented Jan 29, 2019

GDAL allows you to bring-your-own format-specific drivers, but we (JuliaGeo and/or OSGeo) are not at the point where we can allow you to easily swap in a native julian implementation of a driver into GDAL. Nonetheless, if you're interested, here is their tutorial. Their code is X/MIT-licensed, so you can see how they currently do it. E.g. here's open() for ehdr.

Is it really so slow that you need to be implementing your own driver though? You can always file an issue upstream in the GDAL repository if so: it might benefit GDAL users in R/Python/Rust/etc too.

@mkborregaard
Copy link
Owner Author

mkborregaard commented Jan 29, 2019

No it's not that it's slow - it's that I want to avoid the clunky interface that GDAL gives you. Julia's AbstractMatrix stack is so strong - I think much can be said for having a direct julia representation of rasters, with the option of calling GDAL on them once you need something advanced. I would never in my practical work need all the capabilities of GDAL; in fact not any, other than opening different raster formats and possibly reprojection, so for me it's overkill to have to registerdrivers() do etc. for simple things.

@yeesian
Copy link

yeesian commented Jan 29, 2019

No it's not that it's slow - it's that I want to avoid the clunky interface that GDAL gives you. [...] so for me it's overkill to have to registerdrivers() do etc. for simple things.

Just to be clear: that's just a limitation of an existing julia wrapper, which I'm actively working to address in yeesian/ArchGDAL.jl#76, and not a fundamental limitation of working with GDAL though.

You can still easily read in the raster data (see the examples in #1 (comment)) and have access to all the other metadata and functionality for free via GDAL.

@yeesian
Copy link

yeesian commented Jan 29, 2019

No it's not that it's slow

The only reason to be parsing the grd file yourself, like in

function VerySimpleRaster(grdfile::String)
@assert grdfile[end-3:end] (".grd", ".gri")
fname = grdfile[1:end-4]
data = read_grd(fname*".grd")
ncols = parse(Int, data["nrows"])
nrows = parse(Int, data["ncols"])
dt = datatype_translation[data["datatype"]]
fname = abspath(fname*".gri")
mat = open(fname, "r") do IO
Mmap.mmap(IO, Matrix{dt}, (nrows, ncols))
end
xmin, xmax = parse(Float64, data["xmin"]), parse(Float64, data["xmax"])
ymin, ymax = parse(Float64, data["ymin"]), parse(Float64, data["ymax"])
celly = (ymax - ymin) / ncols
cellx = (xmax - xmin) / nrows
VerySimpleRaster(mat, parse(dt, data["nodatavalue"]), xmin, ymin, cellx, celly, data["projection"], fname)
end
is if you deem the corresponding GDAL driver "too slow", which doesn't seem to be the case for you.

@mkborregaard
Copy link
Owner Author

mkborregaard commented Jan 29, 2019

Nice. So just to understand, do you not think there isn't scope for the pure-julia mmap approach once you're further with ArchGDAL?

By the way,

The only reason to be parsing the grd file yourself

another reason for doing this is then this package does not need a dependency on GDAL. The GDAL integration could live somewhere else.

@mkborregaard
Copy link
Owner Author

mkborregaard commented Jan 29, 2019

I thought you wrote that GDAL always loads it into memory? The thing is I want a handle on the raster contents, on-disk, that I can access directly (e.g. via getindex) - I don't care about which driver.

@yeesian
Copy link

yeesian commented Jan 29, 2019

I thought you wrote that GDAL always loads it into memory?

That's just my own implementation of it in https://github.com/yeesian/ArchGDAL.jl/blob/d5ee1fd0c7b6b3932178d1696ac3a6284fab48dd/src/dataset.jl#L259-L271.

GDAL doesn't load data into memory by default. But you might run into nasty memory bugs later, e.g. like in JuliaGeo/LibGEOS.jl#58.

@mkborregaard
Copy link
Owner Author

So just to understand what you suggest exactly - scope for separate packages? Integration? Abandon this as soon ArchGDAL is finished with developments in the process of being merged?

@yeesian
Copy link

yeesian commented Jan 29, 2019

There is currently GeoJSON.jl and Shapefile.jl which natively reads into geojson and shapefile formats too. They are vector data formats, so we borrowed an "informal" interface in python for https://github.com/JuliaGeo/GeoInterface.jl for some degree of unification across those packages.

There is a discussion (which you're involved in) on a similar kind of interface for rasters. JuliaGeo has NetCDF.jl too, so it could help e.g. unify plotting for rasters across NetCDF and "R Raster and EHdr" (which this package seems to be oriented around. See https://www.gdal.org/frmt_various.html and the references therein for details.) and Arch/GDAL. But I don't work enough with rasters, or know enough about them to have an informed opinion about the direction it (the raster interface) might take.

@rafaqz
Copy link

rafaqz commented May 27, 2019

I found this while thinking about writing something similar myself. I agree entirely that better raster handling is pretty fundamental to doing biogeography in Julia!

I feel that we need a RasterBase package that defines AbstractRaster and methods for working with coordinates and projections. I'm writing a bunch of packages for spatial simulations and this should be a basic type for all of them... It would also be great if my simulation packages kept spatial metadata bundled with matrices, so we can remove the need for passing in spatial parameters, and just plot the results with a recipe.

VerySimpleRaster seems like a good demonstration of this, but it would be great to abstract the interface, then provide integrations for NetCDF, ArchGDAL and similar packages.

I've read some of the other threads talking about something like this, is there a plan for something similar already in progress, or should I just go ahead and implement something?

@mkborregaard
Copy link
Owner Author

I'd happily depend on and adhere to an Abstract interface. I think it's just a question of doing it. In addition to @yeesian it'd probably be useful to speak to @evetion who has GeoRasters.

An alternative (possibly better) would be to put the abstract interface here, with the concrete implementation here as a fallback for the lightest case? This package has essentially 0 dependencies (well except RecipesBase but that's essentially a stdlib), to make it possible for more advanced implementations to depend on it at 0 cost. Happy to change this package to accomodate.

btw there's also some discussion here JuliaGeo/GeoInterface.jl#16 as any abstract interface should be compatible with the geointerface imho.

@rafaqz
Copy link

rafaqz commented May 27, 2019

Yeah I was also thinking the interface should include a concrete component for the most basic case.

I'll read through those threads too. Unfortunately I'm not that experienced working with rasters and spatial data, I've mostly written models and frameworks that handle matrices, and let others do the actual raster manipulation in R! But that's a short term solution that needs attention soon. I'll put a small base package together in a few weeks (after thesis hand in!) but feedback will be really useful.

Isn't there a HDF5 and Mmap dep in this package? I was thinking it should be much more lightweight than that.

@mkborregaard
Copy link
Owner Author

mkborregaard commented May 27, 2019

Mmap is part of the Base Julia distribution, so not a dep, just like Dates. HDF5 isn't used - it's just in the project.toml because I was experimenting with it.
Before you start building an alternative package, do consider if we can do it in the context of this one. Happy to give you push rights if you're interested in collaborating.

@rafaqz
Copy link

rafaqz commented May 27, 2019

Sure! if you're open to that I can work here.

I noticed you mentioned the problem of time as well in the other thread. It's a common problem for me too. I also have to deal with vertical dimension of layered environmental data. So it would be good to define common ways of accessing additional temporal and spatial dimensions. I was going to write a MicroclimateBase package specifically for this, but it would be good if that could depend on a more general base package and just add methods for the specific environmental variables instead of the whole framework for data access.

@mkborregaard
Copy link
Owner Author

Great, you should have push rights now. And yes goot point on the time dimension.

@juliohm
Copy link

juliohm commented May 27, 2019 via email

@rafaqz
Copy link

rafaqz commented May 27, 2019

Great! NetCDF integration is something I need too.

The packages I'm developing that are specifically related are Dispersal.jl and Cellular.jl for general Cellular automata style models, and dispersal modelling with heterogeneous spatial data and various dispersal and growth processes. GrowthRates.jl provides simple growth rate grids for dispersal modelling from raster climate data (from HDF5), and Microclimate.jl for providing common microclimate data for more complex species distribution modelling, from NetCDF files. They are all in various stages of development, and all need more generalised interfaces, and further abstraction of the NetCDF and HDF5 data sources. Some kind of AbstractRaster seems to be a key component, although its taken me a while to realise that.

Honesty, I don't really understand what GeoStats and ClimateTools do! there's a bit of a mismatch in the applications and language we use in ecology/biogeography.
Browsing the code it seems like AbstractSpatialData in GeoStatsBase could be what we are looking for? although the T<:Real limit would be a problem for me (Unitful).

I see ClimateTools.jl deals with 3d and 4d data in which something I specifically need for working with microclimates and I'm doing in ad-hoc ways currently. The ClimGrid type also seems to have a lot of what we need for biogeography, but also a lot more that we don't need! But inheriting from some shared abstract type would be good.

@mkborregaard
Copy link
Owner Author

@rafaqz if you're working on Species Distribution Models you may also want to check out @tpoisot 's work here: https://github.com/EcoJulia/SimpleSDMLayers.jl (just started last week) and here: https://gitlab.com/tpoisot/BioClim

@mkborregaard
Copy link
Owner Author

After looking at your repos, in general you might actually be interested in the EcoJulia organisation - the mission is to create a framework for various aspects of community ecology and geographical ecology in Julia.

@rafaqz
Copy link

rafaqz commented May 27, 2019

I've been thinking about EcoJulia for a while! Was waiting until I had a few things to contribute :)

I mostly develop mechanistic SDMs, dispersal models and biophysical modelling tools (also see my other works in progress Photosynthesis.jl and DynamicEnergyBudgets.jl)

But this means my data requirements are often a lot more fine grained spatially and temporally than worldclim style data ins SDMLayers. But both of those packages seem like they would good candidates for incorporation into a more abstract approach to raster data?

@juliohm
Copy link

juliohm commented May 27, 2019

The AbstractSpatialData type in GeoStatsBase.jl covers all spatial data types in the docs: https://juliohm.github.io/GeoStats.jl/stable/spatialdata/

I did put the T<:Real restriction there without thinking about Units. But we surely can extend the type to any T, and handle units more gracefully. I have to check out the Units packages in Julia first. I am not using them yet.

@rafaqz
Copy link

rafaqz commented May 27, 2019

Ok great, that would probably be a good type to inherit from then, although I'll have to get my head around your package more.

(I like to add units wherever load NetCDF with known units to memory, so that everything is automatically correct after that)

@mkborregaard
Copy link
Owner Author

IMHO the priority should be to be compatible with / inherit from the JuliaGEO types. Your types are just compatible among your own packages @juliohm , right?

@rafaqz
Copy link

rafaqz commented May 27, 2019

One issue for me is the "Stats" part of GeoStatsBase!

I rarely do statistical modelling, and tools for stats do complicate the package considerably in terms of contributing to it, and orienting the ecosystem of (non-statistical) tools I would be building around it. I was imagining something more like GeoBase, with no deps at all, just an abstract interface for basic spatial types.

@rafaqz
Copy link

rafaqz commented May 27, 2019

Ok I still don't have my head around how JuliaGEO and all these packages relate!

Should we be aiming to dproduce some common basis for all (or as many as possible) of these use cases?

@rafaqz
Copy link

rafaqz commented May 27, 2019

I agree on leaving Missing up to implementations, sometimes I specifically disallow missing in modelling applications as it can be much slower. (Edit: and you specifically can't use missing in dispersal simulations unless you want to output whole grid of missings. But I still want spatial metadata to pass through to the output array...)

@mkborregaard
Copy link
Owner Author

mkborregaard commented May 27, 2019

Yeah sounds good, and also essentially many of the functions I define here are part of a minimal interface I'd think, eg. getindex, extract, crop, mask, bbox and in. Maybe even aggregate though that's not as minimal?

@rafaqz
Copy link

rafaqz commented May 27, 2019

It might be good to add some kind of abstract type hierarchy to deal with additional spatial and temporal dimensions, clarifying what N (in AbstractGeoArray{N,T}) means for any particular dataset in a standard way. Handling 3d and 4d datasets. I'm mostly using 4d microclimates so this is particularly important to me, sometimes 5d...

ClimateTools.jl deals with 3 or 4 dimensions like this, so N is 3 or 4. Time is always the last dimension, with the first two dimensions being long lat, and the third dimension is elevation for 4d:

  if ndims(data) == 3
    # Convert data to AxisArray
    dataOut = AxisArray(data, Axis{Symbol(lonname)}(lon_raw), Axis{Symbol(latname)}(lat_raw), Axis{:time}(timeV))
  elseif ndims(data) == 4 # this imply a 3D field (height component)
    # Get level vector
    plev = ds["plev"][:]#NetCDF.ncread(file, "plev")
    # Convert data to AxisArray
    dataOut = AxisArray(data, Axis{Symbol(lonname)}(lon_raw), Axis{Symbol(latname)}(lat_raw), Axis{:plev}(plev), Axis{:time}(timeV))
  else
    throw(error("load takes only 3D and 4D variables for the moment"))
end
```

@mkborregaard
Copy link
Owner Author

@juliohm just to be explicit I'm 100% in favour of integration across the julia ecosystem, and I think the ecosystem you're singlehandedly building is impressive. I'd just prefer this happened through shared compatibility with the JuliaGeo org. Does this make sense to you?

@juliohm
Copy link

juliohm commented May 27, 2019 via email

@tpoisot
Copy link

tpoisot commented May 27, 2019

Would love to attend the meeting - as @mkborregaard said, we are working on SimpleSDMLayers for rapid development of bioclim-type models, but there are a lot of niches to fill here. We can decide to work with compatible types, or decide on a common interface to implement for all types, or both. These are not trivial decisions, and the more people express their opinions, the better.

@rafaqz
Copy link

rafaqz commented May 28, 2019

@juliohm I'm keen to attend such a meeting. It would be good to delay for 2 weeks so that there is time to learn enough about everyones packages, and write a simple API package to have as a discussion point? (I also need to finish my thesis this week)

@mkborregaard The number of use cases here makes me think we need a very simple package without an implementation, just for clarity. So starting a fresh package might be better.

@rafaqz
Copy link

rafaqz commented May 28, 2019

@mkborregaard
Copy link
Owner Author

So starting a fresh package might be better.

Sure, if you think so, we can always add stuff in from existing packages later

@rafaqz
Copy link

rafaqz commented Jun 7, 2019

Heres a prototype base package for discussion:

https://github.com/rafaqz/GeoArrayBase.jl/blob/master/src/GeoArrayBase.jl

It might need to be called GeoBase as I incorporated non-array geo data types as in GeoStatsBase etc that should use the same methods.

As well as basic methods it has some definitions of array dimensions, crs projection formats and calendars that don't seem to be standardised anywhere. They probably belong in other Base packages.

@juliohm
Copy link

juliohm commented Jun 16, 2019

@rafaqz, the T<:Real restriction in GeoStatsBase.jl is gone now, so in theory one should be able to use Unitful.jl coordinates with the spatial objects defined therein.

Thanks for putting together a GeoArrayBase.jl proposal. I believe that we also need to concentrate on other spatial configurations in this GeoBase.jl package we are discussing. For example, structured grids, point collections, images, etc. These types are kind of what I have in GeoStatsBase.jl + GeoStatsDevTools.jl, and they serve my current needs. However, they lack the concept of projections, for example. Also, I did not have a chance to include the time dimension in any of those. I don't know if space and time should be mixed actually.

I did a major rewrite of the interface in GeoStatsBase.jl to facilitate some code reuse (e.g. plot recipes) and other planned features. In the spatialobject.jl file you can find a general interface for anything spatial (that fits in the concept of point-based measurements). From this abstract type, we can define at least two sub-types, which are still abstract: spatial domain, and spatial data.

The domain type is pretty much the same as a general spatial object type. It only materializes the idea of a collection of points with coordinates. It is basically a trick to use the spatial object interface, and still have a new type on which to dispatch.

The spatial data type is a domain plus some data. For example, for each point of the underlying domain, we can save properties. Most implementations in GeoStatsDevTools.jl save these properties in a dictionary mapping symbols to flat vectors of values. Note, however, that because we inherit from spatial object, and because we have domain types already implemented, all we need to do is wrap the domain in a new type containing the domain and an extra dictionary. Everything else we get for free regarding spatial queries.

This interface is working nicely so far within GeoStats.jl, and I'be happy to go over it in a call if you guys feel it is appropriate. My hope is that at some point these domain types that I am writing can be reviewed and extended to become a separate GeoBase.jl or GeoInterface.jl package.

As a final comment, notice that all these types that I wrote above are point-based. I don't have much experience with polygon-based or area-based spatial objects. This GeoInterface.jl package that we are discussing could encompass both classes of spatial data.

@rafaqz
Copy link

rafaqz commented Aug 6, 2019

I have an update on this. I ended up putting together DimensionalData.jl as an abstract xarray-like indexing system that can be used to manipulate and plot tagged dimensional data, that suits this problem better than AxisArrays.

GeoData.jl inherits from that, but adds spatial metadata, and hopefully integrates with spatial data as you are describing @juliohm, Maybe we could start an issue there to discuss this further if this direction seems useful to you? I've looked at AbstractSpatialObject etc but there is still a lot I don't understand the reasoning for - its a big package.

There are some examples, one of which is this (g is a 3d GeoArray):

# plot the mean sea surface temperature around Australia from August to December 2002
dimranges = Time((DateTime360Day(2002, 08, 1), DateTime360Day(2002, 012, 30))), Lat(-45:0.5), Lon(110:160)
select(g, dimranges) |> x->mean(x; dims=Time) |> plot

Which plots ~correctly with axis labels and metadata even after select() and mean(). It would be great if select could accept a polygon as well.

@mkborregaard I'm not sure how much this covers what you wanted with VerySimpleRasters. The main missing piece is to integrate with NCDatasets, GDAL (or wrapper packages) for specific load/save/conversion. Which will probably require a lot more changes to GeoDims as well... There are lots of other small pieces unfinished as well, and feedback would be very useful.

@evetion Although this doesn't look like it will work with affine transforms, it will... But they are optional as a lot of the time we don't need them. AffineDims will just wrap any subset of the dimensions and hold the AffineMap. So a concrete type doesn't need an affine field, just dims and any combination will just work. That also means normal indexing just uses dimensions, but select() will use the affine transform when it exists.

@juliohm
Copy link

juliohm commented Aug 6, 2019

Thank you again @rafaqz for your efforts! I am still trying to understand to what extent the types proposed can be reused directly within GeoStats.jl. Could you please confirm that the types support non-regularly sampled points? For example, something that looks like this or simple point sets without array-like layouts.

Besides this non-regular types, I need to extend the existing types in GeoStatsBase.jl to introduce the notion of volume element. Some geostatistical methods can handle spatial measurements over different support. For example, it is common to have measurements of variable A in one resolution, and measurements of another variable B in a different resolution. In order to handle this scale difference explicitly in statistical models, we need to encode the volume of the measurement.

I am starting to think that we don't need types, but a comprehensive traits interface. It will be hard to conciliate all the proposed packages since everyone in JuliaGeo has different use cases.

@mkborregaard
Copy link
Owner Author

Hi I'm really sorry for not engaging more with this. I'm currently in Ecuador and will be travelling various places until mid-September - at which time I will absolutely love to engage fully with this again.

@visr
Copy link

visr commented Aug 6, 2019

@rafaqz awesome that you're exploring this space. Haven't had time to look into everything, but just briefly checked out DimensionalData. Have you considered using/working with the folks at https://github.com/invenia/NamedDims.jl and https://github.com/invenia/IndexedDims.jl? They have similar goals, and aim to be the next AxisArray (and are inspired by xarray).

@rafaqz
Copy link

rafaqz commented Aug 6, 2019

@juliohm non-regular grids are doable. Dims could hold the matrices and handle dispatch so it would work on regular data types:

dims = Lat(latmatrix), Lon(lonmatrix)
a = GeoArray(data, dims)
select(a, Lat((1, 7)), Lon((1, 7)))

There would just need a few methods that dispatch on AbstractDimension{Matrix} to retrieve the indices with a splat.

But the dims with indexible integer lat/long also isn't the usual case... Maybe a wrapper type on the matrix could make thar clearer, Lat(Curvilinear(latmatrix)).

Otherwise the main change this approach would mean for GeoStatsBase is either adding/loosening a dims field to structs and using it to store dimension order and bounds, or just generating dims in the dims() method using your current fields and fixed dimension order. Also adding D for AbstractSpatialObject{T,N,D} <: AbstractGeoDataset{N,D} for dispatch.

The other thing is a rebuild() method so that dims can be updated after subsetting with select() or reduction of dimensions. That's how the above example still plots through the reduction and crop.

I'm not sure if you will see these changes as worthwhile! but separating out dims seems pretty powerful for interop to me - number and order of dimensions, grid spacing, coordinates and type (like DateTime etc), bounds, units, affine transforms, and possibly other metadata are all available from one method/field without changing anything else.

You can also just add a dims() method (without inheritance) and half of DimensionalData still works, but it would mean more code in the end.

@mkborregaard if I was travelling in Ecuador I wouldn't be engaging either :) But try the example when you get back the plotting is pretty smooth.

@rafaqz
Copy link

rafaqz commented Aug 6, 2019

@visr thanks for pointing that out. I hadn't seen IndexedDims yet, but NamedDims has some structural issues, similar to AxisArrays and the reason I didn't use them. They are locked into using the concrete type NamedDimsArray. Using a wrapper means we cant dispatch on the wrapped type for other purposes, which limits ecosystem potential. I'm also not sure how I would use them with arbitrary indexing ranges, affine maps or convolution matrices described above.

In contrast any array type can use DiminsionalData,jl indexing by adding only one method - dims(::SomeType).

AbstractDimension is the main component used for dispatch, rather than the array type. There are only 50 line dedicated to AbstractDimensionalArray, mostly basic array interface, and making similar() etc maintain the array type. And this is the actual code for the concrete array type that will update its dims through reduce functions and use plot recipes:

struct DimensionalArray{T,N,D,R,A<:AbstractArray{T,N}} <: AbstractDimensionalArray{T,N,D}
    data::A
    dims::D
    refdims::R
end
DimensionalArray(a::AbstractArray{T,N}, dims; refdims=()) where {T,N} = 
    DimensionalArray(a, checkdims(a, dims), refdims)

# Array interface (AbstractDimensionalArray takes care of everything else)
Base.parent(a::DimensionalArray) = a.data

# DimensionalArray interface
rebuild(a::DimensionalArray, data, dims, refdims) = DimensionalArray(data, dims, refdims)

dims(a::DimensionalArray) = a.dims
refdims(a::DimensionalArray) = a.refdims

So adding it to other packages is pretty easy.

@rafaqz
Copy link

rafaqz commented Aug 6, 2019

Also @visr I really recommend running the examples script! data manipulation and plotting is so easy.

@visr
Copy link

visr commented Aug 6, 2019

They are locked into using the concrete type NamedDimsArray. Using a wrapper means we cant dispatch on the wrapped type for other purposes, which limits ecosystem potential. I'm also not sure how I would use them with arbitrary indexing ranges, affine maps or convolution matrices described above.

Maybe it's good to bring this up with the devs, you have some good ideas and clear usecases that they are looking for. Just trying to save you maintenance work in the long end ;) It seems that most of your code is not geo specific anyway, which is a good sign. If I understand correctly, IndexedDims.jl wants to allow different kind of indexes by using AcceleratedArrays (great package). I guess making an AffineIndex using that package should be relatively straightforward.

@rafaqz
Copy link

rafaqz commented Aug 6, 2019

I'll let them know what I've been working on for sure. But IndexDims is all concrete again while DimensionalData is an abstraction designed to be bolted on to things... So I can write an easy pull request to NCDatasets (etc) to add dimensional indexing, data manipulation and plotting without getting in the way. It will be less work because the abstraction fits better.

@rafaqz
Copy link

rafaqz commented Aug 8, 2019

This works for NCDatasets with a very small pull reqest. Just need to make it save back to netcdf and write some tests (and probably handle a lot of other cases I haven't considered yet)
JuliaGeo/NCDatasets.jl#35

It should be easy to do a similar thing for other packages.

@meggart
Copy link

meggart commented Sep 18, 2019

Cross-posting from the Juliageo/meta thread. As there is some consensus that we should have a teleconference about the future of Geo Raster types I started a gdocs document and would invite everyone who would be interested in joining such a teleconf to add their availability to the list so that we can try to fix a date and time to discuss the future common interface for Geo Raster data in Julia.

https://docs.google.com/document/d/1ccaSltPDb5n-bLNgWyrotsWXz2n7ckMggOIPejy_GvQ/edit?usp=sharing

@Balinus
Copy link

Balinus commented Dec 19, 2019

It might be good to add some kind of abstract type hierarchy to deal with additional spatial and temporal dimensions, clarifying what N (in AbstractGeoArray{N,T}) means for any particular dataset in a standard way. Handling 3d and 4d datasets. I'm mostly using 4d microclimates so this is particularly important to me, sometimes 5d...

ClimateTools.jl deals with 3 or 4 dimensions like this, so N is 3 or 4. Time is always the last dimension, with the first two dimensions being long lat, and the third dimension is elevation for 4d:

  if ndims(data) == 3
    # Convert data to AxisArray
    dataOut = AxisArray(data, Axis{Symbol(lonname)}(lon_raw), Axis{Symbol(latname)}(lat_raw), Axis{:time}(timeV))
  elseif ndims(data) == 4 # this imply a 3D field (height component)
    # Get level vector
    plev = ds["plev"][:]#NetCDF.ncread(file, "plev")
    # Convert data to AxisArray
    dataOut = AxisArray(data, Axis{Symbol(lonname)}(lon_raw), Axis{Symbol(latname)}(lat_raw), Axis{:plev}(plev), Axis{:time}(timeV))
  else
    throw(error("load takes only 3D and 4D variables for the moment"))
end

It might be useful to add here that the dimensions are standardized by the Climate Forecasts conventions. Thus, having the time dimension as the trailing dimension is set in this convention (FORTRAN in this case... the C dimensions ordering implementation the reverse http://www.bic.mni.mcgill.ca/users/sean/Docs/netcdf/guide.txn_22.html). Now, the interesting part that I haven't looked is that NetCDF.jl seems to be a wrapper around the C library but it returns the FORTRAN ordering. Maybe because Julia memory mapping is akind to FORTRAN?

Anyway, just wanted to point that in climate and forecasting domain, there is some conventions, so it is easier to develop (read "harcode") some decisions.

@visr
Copy link

visr commented Dec 19, 2019

Now, the interesting part that I haven't looked is that NetCDF.jl seems to be a wrapper around the C library but it returns the FORTRAN ordering. Maybe because Julia memory mapping is akind to FORTRAN?

Yeah that's right, see also JuliaIO/Zarr.jl#1 (comment) and yeesian/ArchGDAL.jl#37 for some more discussion about this topic. Nice link about the Fortran dimension ordering, that's what we are implicitly doing.

@rafaqz
Copy link

rafaqz commented Dec 19, 2019

But the conventions break anyway once you slice the array. Making it arbitrary is better as you get to keep dim labels after slicing. GeoData.jl allready does all this seemlessly.

@mkborregaard
Copy link
Owner Author

@rafaqz do you see scope for folding more of this package into GeoData?

@rafaqz
Copy link

rafaqz commented Dec 19, 2019

Grd load/save works although slightly differently, and I have half-finished ascii. Plotting works well, also copied from here originally. I just wrote an aggregate routine. (you can aggregate any dims/shapes, and use point sampling or mean/sum etc methods, it also aggregates the coordinate index).

Mostly what is missing is the coordinate vector methods like extract, and something like isinside as we were discussing on slack. I've been hesitant to copy those over from here as it would be good to do that integrated with GeometryBase/GeoInterface and I don't really understand where that is going yet.

Edit: but yes that is the plan

@Balinus
Copy link

Balinus commented Dec 20, 2019

Now, the interesting part that I haven't looked is that NetCDF.jl seems to be a wrapper around the C library but it returns the FORTRAN ordering. Maybe because Julia memory mapping is akind to FORTRAN?

Yeah that's right, see also meggart/Zarr.jl#1 (comment) and yeesian/ArchGDAL.jl#37 for some more discussion about this topic. Nice link about the Fortran dimension ordering, that's what we are implicitly doing.

That was my suspicion. I do admit that back then, I spent quite a lot of time trying to understand why I was getting F ordering when NetCDF was calling C functions! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants