Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

st_is_valid performance #1666

Closed
pbaylis opened this issue May 1, 2021 · 15 comments
Closed

st_is_valid performance #1666

pbaylis opened this issue May 1, 2021 · 15 comments

Comments

@pbaylis
Copy link

pbaylis commented May 1, 2021

This is may be a dumb question, but I was surprised to find that st_is_valid not only runs more slowly than st_make_valid, but it also scales non-linearly as the number of features increases.

I include a MWE demonstrating the two observations below (I ran this with sf_0.9-8). For my actual use case, I have around 11m features, and at least one of them is causing some sort of downstream validity problem when I try to join or crop to other datasets. Unfortunately it's not one of the problems that st_make_valid can resolve (invalid number of points exception), so I'm reliant on st_is_valid, which hasn't been able to complete in a reasonable amount of time.

I know these functions are reliant on backend code, so maybe this is not a resolvable problem, but if anyone has any ideas for improving performance here I'd be very interested. Also happy to relocate this to StackOverflow if it's more appropriately asked there.

library(sf)
library(dplyr)
library(tictoc)

b0 <- st_polygon(list(rbind(c(-1,-1), c(1,-1), c(1,1), c(-1,1), c(-1,-1))))
b1 <- st_polygon(list(rbind(c(-1,-1), c(1,-1), c(1,1), c(0,-1), c(-1,-1))))

b <- st_sfc(b0, b1) %>% st_sf()

tic()
invisible(st_is_valid(b %>% sample_n(5000, replace = T)))
toc()
# 1.2s on my laptop            

tic()
invisible(st_make_valid(b %>% sample_n(5000, replace = T)))
toc()
# 0.7s s on my laptop

tic()
invisible(st_is_valid(b %>% sample_n(50000, replace = T)))
toc()
# 34s on my laptop            

tic()
invisible(st_make_valid(b %>% sample_n(50000, replace = T)))
toc()
# 6.5s on my laptop
@edzer
Copy link
Member

edzer commented May 1, 2021

As you suggested, work here does not involve actual implementing these functions, but interfacing existing implementations. Hopeful news is that there's both activity in this area going on in GEOS (3.10-dev) and the R s2 package (for spherical validity).

@dbaston
Copy link
Contributor

dbaston commented May 3, 2021

With trivial inputs, the timing is dominated by R overhead rather than the underlying implementations. in the 34s example, less than 10% of the time is spent in GEOS. It looks like a (primary?) difference here is that st_is_valid uses the subsetting operator for an sfc, which does a bit of work, whereas st_make_valid does not.

R overhead aside, you're unlikely to see MakeValid out-perform IsValid, because the first step of MakeValid is to check IsValid. If you have a geometry that is performing especially poorly, your best bet is to make sure you have GEOS 3.9 and open an issue with https://github.com/libgeos/geos.

@dbaston
Copy link
Contributor

dbaston commented May 3, 2021

Looks like you can bypass a lot of the R overhead with st_is_valid(x, NA_on_exception = FALSE), which pushes the iteration from R into C++:

invisible(st_is_valid(b %>% sample_n(50000, replace = T), NA_on_exception = TRUE)) # 34s
invisible(st_is_valid(b %>% sample_n(50000, replace = T), NA_on_exception = FALSE)) # 3.7s

It seems like the culprit is the use of unclass in [.sfc but I don't know any way around it.

@pbaylis
Copy link
Author

pbaylis commented May 3, 2021

Interesting. I thought this solved my problem, but now I realize that it'll throw an error. Still, it's helpful, so thank you. For my purposes, I'm can run this on subsets of the data anyway and can tryCatch using to only use the NA_on_exception = TRUE call when needed.

edzer added a commit that referenced this issue May 3, 2021
@edzer
Copy link
Member

edzer commented May 3, 2021

Oops I didn't mean to push this against the master branch. It think that this will increas the dependency of sf to GEOS 3.5.0 rather than 3.4.0 as now stated, as it requires the error handlers to be set IIRC, but that might not be a bad idea anyway.

library(sf)
# Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(dplyr)

# Attaching package: ‘dplyr’

# The following objects are masked from ‘package:stats’:

#     filter, lag

# The following objects are masked from ‘package:base’:

#     intersect, setdiff, setequal, union

library(tictoc)

b0 <- st_polygon(list(rbind(c(-1,-1), c(1,-1), c(1,1), c(-1,1), c(-1,-1))))
b1 <- st_polygon(list(rbind(c(-1,-1), c(1,-1), c(1,1), c(0,-1), c(-1,-1))))

b <- st_sfc(b0, b1) %>% st_sf()

tic()
invisible(st_is_valid(b %>% sample_n(5000, replace = T)))
toc()
# 0.086 sec elapsed
# 1.2s on my laptop            

tic()
invisible(st_make_valid(b %>% sample_n(5000, replace = T)))
toc()
# 0.485 sec elapsed
# 0.7s s on my laptop

tic()
invisible(st_is_valid(b %>% sample_n(50000, replace = T)))
toc()
# 0.777 sec elapsed
# 34s on my laptop            

tic()
invisible(st_make_valid(b %>% sample_n(50000, replace = T)))
toc()
# 4.399 sec elapsed
# 6.5s on my laptop

@dbaston
Copy link
Contributor

dbaston commented May 4, 2021

It would really be nice to find some alternative to unclass so that an entire sfc is not copied each time you access an element (this must come up elsewhere, not just in st_is_valid).

@edzer
Copy link
Member

edzer commented May 4, 2021

Hmm. Without doubt it would be faster to call list(x[[i]]) instead of x[i], but what do you know about assumptions made later on e.g. that the object passed has a valid double attribute called precision.

@MikkoVihtakari
Copy link

MikkoVihtakari commented May 29, 2021

I also noticed that for a large polygon, sf::st_is_valid is about 10 times slower than a conversion to sp object and rgeos::gIsValid together:

system.time(sf::st_is_valid(pol))
#>  user  system elapsed 
#> 44.203   5.087  49.952 

system.time(rgeos::gIsValid(sf::as_Spatial(pol)))
#>   user  system elapsed 
#>  4.210   0.338   4.604 

Am I missing something here?

For a fully reproducible example, download the ETOPO1_Ice_g_gmt4.grd.gz grid from here and:

devtools::install_github("MikkoVihtakari/ggOceanMapsData") # required by ggOceanMaps
devtools::install_github("MikkoVihtakari/ggOceanMaps")

library(ggOceanMaps)

bathy <- ggOceanMaps::raster_bathymetry(
  bathy =  "pathToTheETOPO1Grid", 
  depths = c(50, 300, 500, 1000, 1500, 2000, 4000, 6000, 10000), 
  proj.out = "EPSG:3995", 
  boundary = c(-180.0083, 180.0083, 30, 90), 
  aggregation.factor = 2
)

pol <- sf::st_as_sf(stars::st_as_stars(bathy$raster), as_points = FALSE, merge = TRUE)

@edzer
Copy link
Member

edzer commented May 29, 2021

@MikkoVihtakari which versions of GEOS do sf and rgeos link to?

@MikkoVihtakari
Copy link

@edzer

> library(sf)
Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
 GEOS runtime version: 3.8.1-CAPI-1.13.3 
 Linking to sp version: 1.4-5 
 Polygon checking: TRUE 

@edzer
Copy link
Member

edzer commented May 30, 2021

I'm seeing this:

> system.time(sf::st_is_valid(pol))
   user  system elapsed 
  0.573   0.001   0.574 
> #>  user  system elapsed 
> #> 44.203   5.087  49.952 
> 
> system.time(rgeos::gIsValid(sf::as_Spatial(pol)))
   user  system elapsed 
  4.785   0.000   4.786 

with

> library(sf)
liLinking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
 GEOS runtime version: 3.9.0-CAPI-1.16.2 
 Linking to sp version: 1.4-5 
 Polygon checking: TRUE 

But wjy would you check the validity of polygons derived from a regular raster?

@MikkoVihtakari
Copy link

Thank you for the great support @edzer. It indeed seems that the problems are on my machine/coding as I expected. Should probably ask this in another thread, but how do I make sf to use the newest GEOS on Mac?

I have updated GEOS:

brew reinstall geos
==> Downloading https://homebrew.bintray.com/bottles/geos-3.9.1.big_sur.bottle.t
==> Reinstalling geos 
==> Pouring geos-3.9.1.big_sur.bottle.tar.gz
🍺  /usr/local/Cellar/geos/3.9.1: 444 files, 10.2MB

I have tried all of the following:

install.packages("sf")
install.packages("sf", configure.args = "--with-proj-lib=/usr/local/lib/")
install.packages("sf", configure.args = "--with-proj-lib=/usr/local/Cellar/geos/3.9.1/")

And always get:

library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1

But wjy would you check the validity of polygons derived from a regular raster?

Because:

all(sf::st_is_valid(pol))
#> [1] FALSE

@MikkoVihtakari
Copy link

MikkoVihtakari commented May 30, 2021

Nevermind the question above. devtools::install_github("r-spatial/sf", configure.args = "--with-proj-lib=/usr/local/lib/") did the trick. Now I get:

system.time(sf::st_is_valid(pol))
#   user  system elapsed 
# 0.862   0.023   0.894 

Does this mean that this operation in the CRAN version is still 10 times slower in sf than rgeos?

I think I'll begin to move the ggOceanMaps package from the old GIS packages to sf (see here: MikkoVihtakari/ggOceanMaps#5). During this process, I may have to ask some technical questions. Would anyone over here be willing to review the code once I am done? I am sure there are smarter ways to write certain parts than I can come up with...

@edzer
Copy link
Member

edzer commented May 30, 2021

Because:

all(sf::st_is_valid(pol))
#> [1] FALSE

Ah, of course, I missed the merge = TRUE.

Would anyone over here be willing to review the code once I am done?

In principle yes, depends on the amount of code and the length of my backlogs at the time you'll ask.

@MikkoVihtakari
Copy link

MikkoVihtakari commented May 31, 2021

Thanks @edzer. I'll let you know when and if...Although, perhaps just better to ask specific questions than dumb the entire code on you. Or let you review one function, and copy that approach for the rest.

I'll make a thread on the ggOceanMaps forum and tag you, when I need help.

For my part, this thread can be closed. The performance of st_is_valid is impressive in sf v0.9.9 with GEOS 3.9.1.

@edzer edzer closed this as completed May 31, 2021
edzer added a commit that referenced this issue Aug 24, 2021
moves the [.sfc logic to c++; fixes the case where reason=TRUE and exceptions arise
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants