Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add geometric data types and functions #543

Merged
merged 5 commits into from
Sep 28, 2023

Conversation

richtia
Copy link
Member

@richtia richtia commented Aug 16, 2023

PR to add new geometric data types and functions.

extensions/functions_geometry.yaml Outdated Show resolved Hide resolved
extensions/functions_geometry.yaml Outdated Show resolved Hide resolved
value: fp32
return: point
-
name: "st_makeline"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the point creation function name st_point but this function is st_makeline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the names are from postgis

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we correct the naming here to avoid perpetuating their mistake? st_makeline seems to be the problematic one as it's actually performing a join on two arbitrary geometries instead of two points. Or is it making a line between the two geometries?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ST_ (Spatial Type) prefix isn't really necessary in Substrait as we're already associating all of the geometry functions within the geometry YAML file. I believe we should probably use MakePoint and MakeLine to be compatible with what PostGIS supports. Some Substrait implementations will only support the 2 argument version of MakePoint which would be the same as Point.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, ST_Point() is in the simple features access standard (which is probably why it is not ST_MakePoint()). If you want to correct the name it would probably be ST_Line()but PostGIS names are probably the option that is most likely to get this spec taken seriously by anybody who will actually implement it. You could also just omit any functions you're not sure about.

extensions/functions_geometry.yaml Outdated Show resolved Hide resolved
Copy link

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for getting this started!

For now, I wonder if what you want is just:

types:
  - name: geometry
    # ... is constraining the storage type mandatory?

Getting into the details of exactly how coordinates are stored and which dimensions are stored (e.g., sometimes there's a Z coordinate for elevation or an M coordinate with time) is a rather large can of worms that geospatial people have a lot of opinions about. If you must specify some constraint on the type, you might want to rename the type after the encoding (e.g., call it wkb and specify that the encoding is well-known binary).

Having just one type simplifies the function definitions: many are "just" geometry in, geometry out. I don't know if you want to get into the weeds on list and structure with point, linestring, polygon, multipoint, multielinestring, and multipolygon...every function treats those differently and I'm not sure that any two GIS engine people would give you the same answer if you asked them to come up with that list. Even then you still need a generic geometry (e.g., st_intersection(polygon, polygon) might give you any combination of points, lines, and/or polygons). You can see how the GeoArrow memory layout spec deals with this if you really want to be specific about types.

One choice you may have to make is whether to add a string parameter "crs" to the type. This is sort of like "timezone"...geospatial people care quite a lot about it. PostGIS handles this by making it datum specific (every value has its own uint32 SRID, which refers to a row in an SRID table); the GeoArrow metadata specification makes it part of the type. You could also punt and say that if it's missing, the CRS is "OGC:CRS84", which is GIS speak for "longitude/latitude". This is what the GeoParquet specification does.

Another term you may see is "GEOGRAPHY"...PostGIS has this type and so does BigQuery and Snowflake. GeoArrow handles this as a type parameter (edges: spherical | planar, with spherical being the one that corresponds to GEOGRAPHY). Geography is always longitude/latitude and never has Z or M values.

@richtia richtia force-pushed the geometry_constructors branch from dc03bbb to ac6f82a Compare September 6, 2023 16:03
@richtia
Copy link
Member Author

richtia commented Sep 6, 2023

Thanks for the feedback @paleolimbot! For now i'll stick with just adding a new geometry extension type you suggested. That also makes the new function specs easier.

@richtia richtia requested a review from paleolimbot September 6, 2023 18:20
value: fp32
return: point
-
name: "st_makeline"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we correct the naming here to avoid perpetuating their mistake? st_makeline seems to be the problematic one as it's actually performing a join on two arbitrary geometries instead of two points. Or is it making a line between the two geometries?

extensions/extension_types.yaml Outdated Show resolved Hide resolved
@richtia
Copy link
Member Author

richtia commented Sep 6, 2023

@EpsilonPrime i'll update the description for st_makeline, the geometries shouldn't be entirely arbitrary, they should both either be a point, a multipoint (collection of points), or a linestring (points that are already connected to each other)

https://postgis.net/docs/en/ST_MakeLine.html

value: fp32
return: point
-
name: "st_makeline"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ST_ (Spatial Type) prefix isn't really necessary in Substrait as we're already associating all of the geometry functions within the geometry YAML file. I believe we should probably use MakePoint and MakeLine to be compatible with what PostGIS supports. Some Substrait implementations will only support the 2 argument version of MakePoint which would be the same as Point.

extensions/extension_types.yaml Outdated Show resolved Hide resolved
extensions/functions_geometry.yaml Show resolved Hide resolved
Copy link

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments!

You may also be interested in the Simple Features Access standard for SQL which defines these functions and their return types (link and a few relevant sections pasted below). The functions that never made it to PostGIS are probably not worth including.

Link: https://portal.ogc.org/files/?artifact_id=25354

Methods on abstract geometry class:

METHOD ST_Dimension()
RETURNS SMALLINT
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_GeometryType()
RETURNS CHARACTER VARYING(ST_MaxTypeNameLength)
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_AsText()
RETURNS CHARACTER LARGE OBJECT(ST_MaxGeometryAsText)
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT
RETURNS NULL ON NULL INPUT,

METHOD ST_AsBinary()
RETURNS BINARY LARGE OBJECT(ST_MaxGeometryAsBinary)
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_IsEmpty()
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_IsSimple()
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Boundary()
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Envelope()
RETURNS ST_Polygon
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Equals(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Disjoint(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Intersects (ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Touches(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Crosses(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Within (ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Contains(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Overlaps(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Relate(ageometry ST_Geometry, amatrix CHARACTER(9))
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Distance(ageometry ST_Geometry)
RETURNS DOUBLE PRECISION
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Intersection(ageometry ST_Geometry)
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Difference(ageometry ST_Geometry)
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Union(ageometry ST_Geometry)
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_SymDifference (ageometry ST_Geometry)
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Buffer (adistance DOUBLE PRECISION)
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_Buffer ( adistance DOUBLE PRECISION,
aunit CHARACTER VARYING(ST_MaxUnitNameLength))
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,

METHOD ST_ConvexHull()
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT

...there are more but you probably don't need all of them?

extensions/extension_types.yaml Outdated Show resolved Hide resolved
extensions/functions_geometry.yaml Show resolved Hide resolved
extensions/functions_geometry.yaml Outdated Show resolved Hide resolved
value: fp32
return: point
-
name: "st_makeline"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, ST_Point() is in the simple features access standard (which is probably why it is not ST_MakePoint()). If you want to correct the name it would probably be ST_Line()but PostGIS names are probably the option that is most likely to get this spec taken seriously by anybody who will actually implement it. You could also just omit any functions you're not sure about.

extensions/extension_types.yaml Outdated Show resolved Hide resolved
extensions/extension_types.yaml Outdated Show resolved Hide resolved
extensions/extension_types.yaml Outdated Show resolved Hide resolved
extensions/extension_types.yaml Outdated Show resolved Hide resolved
@richtia richtia force-pushed the geometry_constructors branch from 0196131 to e750368 Compare September 12, 2023 23:15
value: fp64
- name: y
value: fp64
return: geometry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geometry as a type would only be valid if it was a core substrait type (e.g i64). To reference a user-defined type, the syntax would be u!geometry.

Currently though, there's a limitation where types from one extension file can't be referenced from a different extension file: #496

This means that you can't reference the geometry type from extension_types.yaml (or really any of the types) in this file.

One way to work around this for now would be to define geometry and the other types we need directly in this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Made some updates!

Copy link
Member

@EpsilonPrime EpsilonPrime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should work nicely. Thanks for handling all the iterations.

Copy link

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working through all of these details! Geometry is hard but I am excited to this addition (+ future additions that it enables!) to the spec!

@richtia richtia merged commit db52bbd into substrait-io:main Sep 28, 2023
Comment on lines +31 to +34
- name: x
value: u!geometry
- name: y
value: u!geometry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick (and I also don't know if substrait has something like positional args where the actual name doesn't matter, in which case it is only for documentation): you might want to use different argument names than x and y, since naively that would easily be interpreted as x and y coordinates (like in point()), while here it's rather "point 1" and "point 2".
So I would rather go with something like geom1/geom2, or if preferring something generic, eg a/b

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh...I think I missed this when we decided to change to using the opaque geometry type. Thanks for catching this! I'm addressing it in the PR: https://github.com/substrait-io/substrait/pull/554/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants