-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spatial joins #113
Spatial joins #113
Conversation
Can't merge until I figure out why the benchmark theme is leaking in the docs, but this is essentially complete aside from tests now. |
|
||
@testset "Basic integration" begin | ||
|
||
@test_nowarn joined_df = FlexiJoins.innerjoin((poly_df, points_df), by_pred(:geometry, GO.contains, :geometry)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to wrap this and avoid the :geometry
:geometry
part
Like this maybe:
joined_df = GO.innerjoin(poly_df, points_df, GO.contains)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're saying here now - but until we resolve the DataFrames/GI.geometrycolumn/ArchGDAL mess, it's probably best to keep this explicit IMO :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it actually a mess? It mostly just works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case yes - GADM for example only outputs tables with geom
columns, as do many other ArchGDAL-based loaders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArchGDAL tables are fine, GADM returns a NamedTuple. Theres an issue for that
Is the idea we would do this on prepared geometries, or at least GI wrappers with the extent calculated? Nested loops that recalculate extent for every combination are going to be pretty slow. Probably we should do an extent generating pass before doing the join, but that breaks the Flexijoins abstraction a little... |
Good point - in that case I can see the argument for wrapping it. Though this loses the ability to do true SQL-like joins via FlexiJoins (since I don't know whether we are able to define a new predicate which can mutate the geometry...) We can always say that this is just for convenience, and you should use FlexiJoins for any real work? |
Interesting direction! The user-facing API in FlexiJoins is stable, well-tested, and reliable. As for extending with more predicates – this part may evolve somewhat over time, for the sole reason that it's more more rarely used and not all sharp edges have been smoothed out. The main feature of FlexiJoins is doing composable & performant joins, preferably avoiding n^2 loops at the default. It definitely has an ability for precomputations, they are extensively used for all basic join predicates. Basically, first |
Hmm, at least for one side we could use Reading the code, it seems like |
Sounds like exactly what
Yeah, that's the current design.
I understand this can significantly improve performance sometimes, but such a workflow isn't implemented yet. But would be nice to have indeed at some point! Actually, preparing both sides can give a boost in many cases, from sorted-joins to tree NN joins. |
Yeah I can see that being a pain for sure. It's probably not too much trouble to ask users to run |
And still prepare on one side + loop over the other makes sense to implement IMO in general. Is it a reasonable thing for geojoin? |
I think so, we can at least start with providing extents. We are creating a prepared geometry interface in #105 which will slot into the position of the |
Calculating extent for the "a" side and re-wrapping geoms in But eventually we probably don't want to run those double nested loops on large datasets either, and instead precalculate an STRtree or similar instead that we can just query to return the minimum list. We can use existing packages like https://github.com/JuliaGeo/LibSpatialIndex.jl or native in https://github.com/maxfreu/SortTileRecursiveTree.jl. This is what e.g. PostGIS does with an R-tree and a few other options. Maybe FlexiJoins is still good as a small dataset/single use approach. |
I think FlexiJoins has a tree based approach as well, I just haven't implemented it yet :D |
Ok amazing if it can handle trees to. @maxfreu s package is already GI compatible, if we can hook up SortTileRecursiveTree.jl and FlexiJoins.jl then this could really work well. |
Totally, build tree in Btw, if you know of a better/faster packages for general range search, please tell me! NN.jl unfortunately doesn't have very generic implementation. |
Ok great that's all built in already. I'm not sure about the general problem! But for geometries NN.jl is less suitable. The standard approach is to match geometries by precalculated extent tiles and then run actual checks only on the short-list of crossing extents. Crossing extents doesn't mean geometries actually cross, but they are quite likely to. So the R-tree gets us most of the way. The plots here are a good overview: |
Co-authored-by: Alexander Plavin <[email protected]>
Thanks for the review @aplavin! I wasn't sure how to implement some of these but it makes a lot more sense now :) If you'd like I can write a quick blurb on how to get FlexiJoins working with user-defined predicates, now that we've done it here! Not sure where it would go though. |
|
||
```julia | ||
my_predicate_function = <(5) ∘ abs ∘ GO.distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is an actual distance, it's probably already supported by FlexiJoins :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is supported by Distances.jl, and there are a bunch of other GO functions one might want to use :D - for example, testing whether centroids are close to each other, or something. So I figured a general approach would be best.
Just out of curiosity, is there a reason that NestedLoopFast
isn't supported by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, testing whether centroids are close to each other
Isn't it just by_distance(x -> centroid(x.geometry), Euclidean(), <=(3))
? Or whatever other distance you need instead of Euclidean
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is supported by Distances.jl
Wonder why is that the case? Does the function break some distance properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason that NestedLoopFast isn't supported by default?
I consider falling back to n^2 join without user explicitly requesting it is a footgun.
NestedLoopFast is really intended for cheap filtering operations on top of the "main" join predicate. Such as NotSame
in FlexiJoins itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For centroids yes, but if computing the distance between geometries (basically distance between closest linesegments) then that's not a Distances.jl thing. The centroid comparison would be that though, and I should probably add an example of that syntax to the docs as well!
This PR hooks the spatial DE-9IM predicates up to FlexiJoins.jl's
by_pred
functionality via a package extension, to enable spatial joins on tables.An example is also added (but this still needs tests, just to check that the functionality works).
cc @JuliaAPlavin @aplavin