-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
st_is_valid performance #1666
Comments
As you suggested, work here does not involve actual implementing these functions, but interfacing existing implementations. Hopeful news is that there's both activity in this area going on in GEOS (3.10-dev) and the R s2 package (for spherical validity). |
With trivial inputs, the timing is dominated by R overhead rather than the underlying implementations. in the 34s example, less than 10% of the time is spent in GEOS. It looks like a (primary?) difference here is that R overhead aside, you're unlikely to see MakeValid out-perform IsValid, because the first step of MakeValid is to check IsValid. If you have a geometry that is performing especially poorly, your best bet is to make sure you have GEOS 3.9 and open an issue with https://github.com/libgeos/geos. |
Looks like you can bypass a lot of the R overhead with
It seems like the culprit is the use of |
Interesting. I thought this solved my problem, but now I realize that it'll throw an error. Still, it's helpful, so thank you. For my purposes, I'm can run this on subsets of the data anyway and can tryCatch using to only use the NA_on_exception = TRUE call when needed. |
Oops I didn't mean to push this against the master branch. It think that this will increas the dependency of library(sf)
# Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(dplyr)
# Attaching package: ‘dplyr’
# The following objects are masked from ‘package:stats’:
# filter, lag
# The following objects are masked from ‘package:base’:
# intersect, setdiff, setequal, union
library(tictoc)
b0 <- st_polygon(list(rbind(c(-1,-1), c(1,-1), c(1,1), c(-1,1), c(-1,-1))))
b1 <- st_polygon(list(rbind(c(-1,-1), c(1,-1), c(1,1), c(0,-1), c(-1,-1))))
b <- st_sfc(b0, b1) %>% st_sf()
tic()
invisible(st_is_valid(b %>% sample_n(5000, replace = T)))
toc()
# 0.086 sec elapsed
# 1.2s on my laptop
tic()
invisible(st_make_valid(b %>% sample_n(5000, replace = T)))
toc()
# 0.485 sec elapsed
# 0.7s s on my laptop
tic()
invisible(st_is_valid(b %>% sample_n(50000, replace = T)))
toc()
# 0.777 sec elapsed
# 34s on my laptop
tic()
invisible(st_make_valid(b %>% sample_n(50000, replace = T)))
toc()
# 4.399 sec elapsed
# 6.5s on my laptop |
It would really be nice to find some alternative to |
Hmm. Without doubt it would be faster to call |
I also noticed that for a large polygon, system.time(sf::st_is_valid(pol))
#> user system elapsed
#> 44.203 5.087 49.952
system.time(rgeos::gIsValid(sf::as_Spatial(pol)))
#> user system elapsed
#> 4.210 0.338 4.604 Am I missing something here? For a fully reproducible example, download the ETOPO1_Ice_g_gmt4.grd.gz grid from here and: devtools::install_github("MikkoVihtakari/ggOceanMapsData") # required by ggOceanMaps
devtools::install_github("MikkoVihtakari/ggOceanMaps")
library(ggOceanMaps)
bathy <- ggOceanMaps::raster_bathymetry(
bathy = "pathToTheETOPO1Grid",
depths = c(50, 300, 500, 1000, 1500, 2000, 4000, 6000, 10000),
proj.out = "EPSG:3995",
boundary = c(-180.0083, 180.0083, 30, 90),
aggregation.factor = 2
)
pol <- sf::st_as_sf(stars::st_as_stars(bathy$raster), as_points = FALSE, merge = TRUE) |
@MikkoVihtakari which versions of GEOS do |
> library(sf)
Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
GEOS runtime version: 3.8.1-CAPI-1.13.3
Linking to sp version: 1.4-5
Polygon checking: TRUE |
I'm seeing this: > system.time(sf::st_is_valid(pol))
user system elapsed
0.573 0.001 0.574
> #> user system elapsed
> #> 44.203 5.087 49.952
>
> system.time(rgeos::gIsValid(sf::as_Spatial(pol)))
user system elapsed
4.785 0.000 4.786 with > library(sf)
liLinking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
GEOS runtime version: 3.9.0-CAPI-1.16.2
Linking to sp version: 1.4-5
Polygon checking: TRUE But wjy would you check the validity of polygons derived from a regular raster? |
Thank you for the great support @edzer. It indeed seems that the problems are on my machine/coding as I expected. Should probably ask this in another thread, but how do I make I have updated brew reinstall geos
==> Downloading https://homebrew.bintray.com/bottles/geos-3.9.1.big_sur.bottle.t
==> Reinstalling geos
==> Pouring geos-3.9.1.big_sur.bottle.tar.gz
🍺 /usr/local/Cellar/geos/3.9.1: 444 files, 10.2MB I have tried all of the following: install.packages("sf")
install.packages("sf", configure.args = "--with-proj-lib=/usr/local/lib/")
install.packages("sf", configure.args = "--with-proj-lib=/usr/local/Cellar/geos/3.9.1/") And always get: library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
Because: all(sf::st_is_valid(pol))
#> [1] FALSE |
Nevermind the question above. system.time(sf::st_is_valid(pol))
# user system elapsed
# 0.862 0.023 0.894 Does this mean that this operation in the CRAN version is still 10 times slower in I think I'll begin to move the |
Ah, of course, I missed the
In principle yes, depends on the amount of code and the length of my backlogs at the time you'll ask. |
Thanks @edzer. I'll let you know when and if...Although, perhaps just better to ask specific questions than dumb the entire code on you. Or let you review one function, and copy that approach for the rest. I'll make a thread on the ggOceanMaps forum and tag you, when I need help. For my part, this thread can be closed. The performance of |
This is may be a dumb question, but I was surprised to find that
st_is_valid
not only runs more slowly thanst_make_valid
, but it also scales non-linearly as the number of features increases.I include a MWE demonstrating the two observations below (I ran this with
sf_0.9-8
). For my actual use case, I have around 11m features, and at least one of them is causing some sort of downstream validity problem when I try to join or crop to other datasets. Unfortunately it's not one of the problems thatst_make_valid
can resolve (invalid number of points exception), so I'm reliant onst_is_valid
, which hasn't been able to complete in a reasonable amount of time.I know these functions are reliant on backend code, so maybe this is not a resolvable problem, but if anyone has any ideas for improving performance here I'd be very interested. Also happy to relocate this to StackOverflow if it's more appropriately asked there.
The text was updated successfully, but these errors were encountered: