-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata encoding options for GeoArrow-encoded columns in GeoParquet metadata #185
Comments
I do think that we should probably require to use (which I think also makes this option of using
I think we should best list the options that are allowed. We can always expand that later if geoarrow grows more options. For the interleaved vs separated layout: I think it is clear that the separated layout has the most benefit in combination with Parquet, because of the statistics you get for free (and maybe better compression / faster reading). But I am not fully sure we should only allow that layout. It's certainly possible to have a case where you don't care about this, and you just need the fastest possible option to store and re-read a bunch of data. And if your target system needs interleaved data (like shapely/geopandas), storing as interleaved might be the fastest option (although I should verify this in practice!) For the actual specification update, we should probably detail for the different geoarrow types to which Parquet type it maps. |
Some advantages/disadvantages I can think of for the different options how to specify this:
Pro is that this the encoding value fully describes the geoarrow type. But a disadvantage is that this adds a whole series of possible values for the "encoding" key. This makes handling of this key a bit more complex (although in Python terms it would be
Pro is that this adds only a single new "encoding" value. But then you also still need to check the value of the other key to get the actual type.
Similar advantage of only adding a single "encoding" value, and additional advantage of not having to add a custom key that is only needed for geoarrow encoded data like above. But clear disadvantage is that you need to transform and combine the two keys manually to get the actual geoarrow type name. |
Since #1 was opened, the https://github.com/geoarrow/geoarrow repo has seen quite a lot of exciting activity...we're getting close to releasing our initial version! We have had a lot of great anecdotal conversations about how or if geometry encoded as GeoArrow should be included in this specification and I wanted to open this issue to formalize some of the points that have been made.
Anecdotally, there has been been general agreement that including a columnar-friendly memory layout (i.e., one that does not require a parser of any kind to access coordinate values) as an option under the
"encoding"
metadata key would be good for GeoParquet because:I think there are two orthogonal things to consider if GeoArrow will be included in a future (e.g., 1.1.0) GeoParquet specification. First, there is the question of how to structure the
"encoding"
. Currently GDAL's experimental support uses the extension type name (as summarised here: https://github.com/geoarrow/geoarrow/blob/main/extension-types.md ) as the encoding key. This is sufficient for a reader to reconstruct a GeoArrow type when reading a Parquet file:We could also just use
"geoarrow"
and declare the extension name somewhere else:...or infer the extension name from the geometry type:
(I don't like that last option because there are GeoArrow extension types for WKT and WKB. Even if they aren't necessarily allowed/encouraged for use in this spec, I don't think we can guarantee that there is one canonical extension name per combination of geometry types and functionally the extension name is what is required for a reader implementation)
The second consideration is which GeoArrow memory layouts to allow. The GeoArrow specification, like the Arrow specification, expanded to fit ways that we know people are already storing geospatial data in Arrow (i.e., it is currently more descriptive than prescriptive). The GeoParquet format and the discussions that went into creating it seem to favour a more prescriptive approach (i.e., restricting the allowed encodings/values to simplify implementations). For example, GeoParquet could provide language like:
The other end of the spectrum would be to just punt to the GeoArrow spec and allow any of the values we've defined.
Looking forward to discussion on this! (cc @kylebarron @jorisvandenbossche )
The text was updated successfully, but these errors were encountered: