st_is_valid performance #1666

pbaylis · 2021-05-01T20:13:50Z

This is may be a dumb question, but I was surprised to find that st_is_valid not only runs more slowly than st_make_valid, but it also scales non-linearly as the number of features increases.

I include a MWE demonstrating the two observations below (I ran this with sf_0.9-8). For my actual use case, I have around 11m features, and at least one of them is causing some sort of downstream validity problem when I try to join or crop to other datasets. Unfortunately it's not one of the problems that st_make_valid can resolve (invalid number of points exception), so I'm reliant on st_is_valid, which hasn't been able to complete in a reasonable amount of time.

I know these functions are reliant on backend code, so maybe this is not a resolvable problem, but if anyone has any ideas for improving performance here I'd be very interested. Also happy to relocate this to StackOverflow if it's more appropriately asked there.

library(sf)
library(dplyr)
library(tictoc)

b0 <- st_polygon(list(rbind(c(-1,-1), c(1,-1), c(1,1), c(-1,1), c(-1,-1))))
b1 <- st_polygon(list(rbind(c(-1,-1), c(1,-1), c(1,1), c(0,-1), c(-1,-1))))

b <- st_sfc(b0, b1) %>% st_sf()

tic()
invisible(st_is_valid(b %>% sample_n(5000, replace = T)))
toc()
# 1.2s on my laptop            

tic()
invisible(st_make_valid(b %>% sample_n(5000, replace = T)))
toc()
# 0.7s s on my laptop

tic()
invisible(st_is_valid(b %>% sample_n(50000, replace = T)))
toc()
# 34s on my laptop            

tic()
invisible(st_make_valid(b %>% sample_n(50000, replace = T)))
toc()
# 6.5s on my laptop

The text was updated successfully, but these errors were encountered:

edzer · 2021-05-01T20:30:49Z

As you suggested, work here does not involve actual implementing these functions, but interfacing existing implementations. Hopeful news is that there's both activity in this area going on in GEOS (3.10-dev) and the R s2 package (for spherical validity).

dbaston · 2021-05-03T00:24:18Z

With trivial inputs, the timing is dominated by R overhead rather than the underlying implementations. in the 34s example, less than 10% of the time is spent in GEOS. It looks like a (primary?) difference here is that st_is_valid uses the subsetting operator for an sfc, which does a bit of work, whereas st_make_valid does not.

R overhead aside, you're unlikely to see MakeValid out-perform IsValid, because the first step of MakeValid is to check IsValid. If you have a geometry that is performing especially poorly, your best bet is to make sure you have GEOS 3.9 and open an issue with https://github.com/libgeos/geos.

dbaston · 2021-05-03T01:01:13Z

Looks like you can bypass a lot of the R overhead with st_is_valid(x, NA_on_exception = FALSE), which pushes the iteration from R into C++:

invisible(st_is_valid(b %>% sample_n(50000, replace = T), NA_on_exception = TRUE)) # 34s
invisible(st_is_valid(b %>% sample_n(50000, replace = T), NA_on_exception = FALSE)) # 3.7s

It seems like the culprit is the use of unclass in [.sfc but I don't know any way around it.

pbaylis · 2021-05-03T15:59:31Z

Interesting. I thought this solved my problem, but now I realize that it'll throw an error. Still, it's helpful, so thank you. For my purposes, I'm can run this on subsets of the data anyway and can tryCatch using to only use the NA_on_exception = TRUE call when needed.

edzer · 2021-05-03T20:24:49Z

Oops I didn't mean to push this against the master branch. It think that this will increas the dependency of sf to GEOS 3.5.0 rather than 3.4.0 as now stated, as it requires the error handlers to be set IIRC, but that might not be a bad idea anyway.

library(sf)
# Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(dplyr)

# Attaching package: ‘dplyr’

# The following objects are masked from ‘package:stats’:

#     filter, lag

# The following objects are masked from ‘package:base’:

#     intersect, setdiff, setequal, union

library(tictoc)

b0 <- st_polygon(list(rbind(c(-1,-1), c(1,-1), c(1,1), c(-1,1), c(-1,-1))))
b1 <- st_polygon(list(rbind(c(-1,-1), c(1,-1), c(1,1), c(0,-1), c(-1,-1))))

b <- st_sfc(b0, b1) %>% st_sf()

tic()
invisible(st_is_valid(b %>% sample_n(5000, replace = T)))
toc()
# 0.086 sec elapsed
# 1.2s on my laptop            

tic()
invisible(st_make_valid(b %>% sample_n(5000, replace = T)))
toc()
# 0.485 sec elapsed
# 0.7s s on my laptop

tic()
invisible(st_is_valid(b %>% sample_n(50000, replace = T)))
toc()
# 0.777 sec elapsed
# 34s on my laptop            

tic()
invisible(st_make_valid(b %>% sample_n(50000, replace = T)))
toc()
# 4.399 sec elapsed
# 6.5s on my laptop

dbaston · 2021-05-04T02:20:31Z

It would really be nice to find some alternative to unclass so that an entire sfc is not copied each time you access an element (this must come up elsewhere, not just in st_is_valid).

edzer · 2021-05-04T06:57:03Z

Hmm. Without doubt it would be faster to call list(x[[i]]) instead of x[i], but what do you know about assumptions made later on e.g. that the object passed has a valid double attribute called precision.

MikkoVihtakari · 2021-05-29T12:26:19Z

I also noticed that for a large polygon, sf::st_is_valid is about 10 times slower than a conversion to sp object and rgeos::gIsValid together:

system.time(sf::st_is_valid(pol))
#>  user  system elapsed 
#> 44.203   5.087  49.952 

system.time(rgeos::gIsValid(sf::as_Spatial(pol)))
#>   user  system elapsed 
#>  4.210   0.338   4.604

Am I missing something here?

For a fully reproducible example, download the ETOPO1_Ice_g_gmt4.grd.gz grid from here and:

devtools::install_github("MikkoVihtakari/ggOceanMapsData") # required by ggOceanMaps
devtools::install_github("MikkoVihtakari/ggOceanMaps")

library(ggOceanMaps)

bathy <- ggOceanMaps::raster_bathymetry(
  bathy =  "pathToTheETOPO1Grid", 
  depths = c(50, 300, 500, 1000, 1500, 2000, 4000, 6000, 10000), 
  proj.out = "EPSG:3995", 
  boundary = c(-180.0083, 180.0083, 30, 90), 
  aggregation.factor = 2
)

pol <- sf::st_as_sf(stars::st_as_stars(bathy$raster), as_points = FALSE, merge = TRUE)

edzer · 2021-05-29T17:59:23Z

@MikkoVihtakari which versions of GEOS do sf and rgeos link to?

MikkoVihtakari · 2021-05-29T21:01:06Z

@edzer

> library(sf)
Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
 GEOS runtime version: 3.8.1-CAPI-1.13.3 
 Linking to sp version: 1.4-5 
 Polygon checking: TRUE

edzer · 2021-05-30T10:05:35Z

I'm seeing this:

> system.time(sf::st_is_valid(pol))
   user  system elapsed 
  0.573   0.001   0.574 
> #>  user  system elapsed 
> #> 44.203   5.087  49.952 
> 
> system.time(rgeos::gIsValid(sf::as_Spatial(pol)))
   user  system elapsed 
  4.785   0.000   4.786

with

> library(sf)
liLinking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
> library(rgeos)
Loading required package: sp
rgeos version: 0.5-5, (SVN revision 640)
 GEOS runtime version: 3.9.0-CAPI-1.16.2 
 Linking to sp version: 1.4-5 
 Polygon checking: TRUE

But wjy would you check the validity of polygons derived from a regular raster?

MikkoVihtakari · 2021-05-30T12:18:13Z

Thank you for the great support @edzer. It indeed seems that the problems are on my machine/coding as I expected. Should probably ask this in another thread, but how do I make sf to use the newest GEOS on Mac?

I have updated GEOS:

brew reinstall geos
==> Downloading https://homebrew.bintray.com/bottles/geos-3.9.1.big_sur.bottle.t
==> Reinstalling geos 
==> Pouring geos-3.9.1.big_sur.bottle.tar.gz
🍺  /usr/local/Cellar/geos/3.9.1: 444 files, 10.2MB

I have tried all of the following:

install.packages("sf")
install.packages("sf", configure.args = "--with-proj-lib=/usr/local/lib/")
install.packages("sf", configure.args = "--with-proj-lib=/usr/local/Cellar/geos/3.9.1/")

And always get:

library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1

But wjy would you check the validity of polygons derived from a regular raster?

Because:

all(sf::st_is_valid(pol))
#> [1] FALSE

MikkoVihtakari · 2021-05-30T12:39:16Z

Nevermind the question above. devtools::install_github("r-spatial/sf", configure.args = "--with-proj-lib=/usr/local/lib/") did the trick. Now I get:

system.time(sf::st_is_valid(pol))
#   user  system elapsed 
# 0.862   0.023   0.894

Does this mean that this operation in the CRAN version is still 10 times slower in sf than rgeos?

I think I'll begin to move the ggOceanMaps package from the old GIS packages to sf (see here: MikkoVihtakari/ggOceanMaps#5). During this process, I may have to ask some technical questions. Would anyone over here be willing to review the code once I am done? I am sure there are smarter ways to write certain parts than I can come up with...

edzer · 2021-05-30T13:10:49Z

Because:

all(sf::st_is_valid(pol))
#> [1] FALSE

Ah, of course, I missed the merge = TRUE.

Would anyone over here be willing to review the code once I am done?

In principle yes, depends on the amount of code and the length of my backlogs at the time you'll ask.

MikkoVihtakari · 2021-05-31T05:13:01Z

Thanks @edzer. I'll let you know when and if...Although, perhaps just better to ask specific questions than dumb the entire code on you. Or let you review one function, and copy that approach for the rest.

I'll make a thread on the ggOceanMaps forum and tag you, when I need help.

For my part, this thread can be closed. The performance of st_is_valid is impressive in sf v0.9.9 with GEOS 3.9.1.

moves the [.sfc logic to c++; fixes the case where reason=TRUE and exceptions arise

edzer added a commit that referenced this issue May 3, 2021

addresses #1666

4911f5a

dbaston mentioned this issue May 4, 2021

Remove unclass call from [.sfc #1673

Merged

MikkoVihtakari mentioned this issue May 30, 2021

Interested in the development of ggOceanMaps? dankelley/oce#1834

Closed

edzer closed this as completed May 31, 2021

This was referenced Aug 17, 2021

st_is_valid returns NA for all geometries when encountering an exception from GEOS when processing any single geometry #1760

Closed

performance of st_is_valid (for projected coordinates) #1761

Closed

edzer added a commit that referenced this issue Aug 17, 2021

revert "fix" to #1666; fixes #1760; continued in #1761

f1397e3

edzer added a commit that referenced this issue Aug 24, 2021

fix #1760 #1761; of interest to #1666

dfeee48

moves the [.sfc logic to c++; fixes the case where reason=TRUE and exceptions arise

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

st_is_valid performance #1666

st_is_valid performance #1666

pbaylis commented May 1, 2021 •

edited

Loading

edzer commented May 1, 2021

dbaston commented May 3, 2021

dbaston commented May 3, 2021

pbaylis commented May 3, 2021

edzer commented May 3, 2021

dbaston commented May 4, 2021

edzer commented May 4, 2021

MikkoVihtakari commented May 29, 2021 •

edited

Loading

edzer commented May 29, 2021

MikkoVihtakari commented May 29, 2021

edzer commented May 30, 2021

MikkoVihtakari commented May 30, 2021

MikkoVihtakari commented May 30, 2021 •

edited

Loading

edzer commented May 30, 2021

MikkoVihtakari commented May 31, 2021 •

edited

Loading

st_is_valid performance #1666

st_is_valid performance #1666

Comments

pbaylis commented May 1, 2021 • edited Loading

edzer commented May 1, 2021

dbaston commented May 3, 2021

dbaston commented May 3, 2021

pbaylis commented May 3, 2021

edzer commented May 3, 2021

dbaston commented May 4, 2021

edzer commented May 4, 2021

MikkoVihtakari commented May 29, 2021 • edited Loading

edzer commented May 29, 2021

MikkoVihtakari commented May 29, 2021

edzer commented May 30, 2021

MikkoVihtakari commented May 30, 2021

MikkoVihtakari commented May 30, 2021 • edited Loading

edzer commented May 30, 2021

MikkoVihtakari commented May 31, 2021 • edited Loading

pbaylis commented May 1, 2021 •

edited

Loading

MikkoVihtakari commented May 29, 2021 •

edited

Loading

MikkoVihtakari commented May 30, 2021 •

edited

Loading

MikkoVihtakari commented May 31, 2021 •

edited

Loading