Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of different polygonizing methods analyzed #13

Closed
adrfantini opened this issue Nov 26, 2017 · 37 comments
Closed

Performance of different polygonizing methods analyzed #13

adrfantini opened this issue Nov 26, 2017 · 37 comments

Comments

@adrfantini
Copy link

adrfantini commented Nov 26, 2017

From #12 I performed some tests on the performance of different polygonizing methods for raster data, both for stars and for raster data.

The input files are the same used in the first blog post about stars, that is the NetCDF file from ftp://ftp.dwd.de/pub/data/gpcc/full_data_daily_V1/full_data_daily_2013.nc.gz.

Code:

library(microbenchmark)
library(ggplot2)

inputFile <- "full_data_daily_2013.nc" #Same data used in star's first blog post:http://r-spatial.org/r/2017/11/23/stars1.html#reading-a-raster-time-series-netcdf

r <- raster::raster(inputFile, varname="p")
s <- raster::stack(inputFile, varname="p", bands=1:2)

r2 <- stars::st_as_stars(r)
s2 <- stars::st_as_stars(s)

mb <- microbenchmark(times=5,
    tr1 <- spex::polygonize(r, na.rm=TRUE),
    tr2 <- raster::rasterToPolygons(r, na.rm=TRUE),
    tr3 <- as(raster::rasterToPolygons(r, na.rm=TRUE), "sf"),
    tr4 <- sf::st_as_sf(r2, as_points = FALSE, na.rm=TRUE),
    tr5 <- stars::st_xy2sfc(r2, as_points = FALSE, na.rm=TRUE),
    ts1 <- spex::polygonize(s, na.rm=TRUE),
    ts2 <- raster::rasterToPolygons(s, na.rm=TRUE),
    ts3 <- as(raster::rasterToPolygons(s, na.rm=TRUE), "sf"),
    ts4 <- sf::st_as_sf(s2, as_points = FALSE, na.rm=TRUE),
    ts5 <- stars::st_xy2sfc(s2, as_points = FALSE, na.rm=TRUE),
    fr1 <- spex::polygonize(r, na.rm=FALSE),
    fr2 <- raster::rasterToPolygons(r, na.rm=FALSE),
    fr3 <- as(raster::rasterToPolygons(r, na.rm=FALSE), "sf"),
    fr4 <- sf::st_as_sf(r2, as_points = FALSE, na.rm=FALSE),
    fr5 <- stars::st_xy2sfc(r2, as_points = FALSE, na.rm=FALSE),
    fs1 <- spex::polygonize(s, na.rm=FALSE),
    fs2 <- raster::rasterToPolygons(s, na.rm=FALSE),
    fs3 <- as(raster::rasterToPolygons(s, na.rm=FALSE), "sf"),
    fs4 <- sf::st_as_sf(s2, as_points = FALSE, na.rm=FALSE),
    fs5 <- stars::st_xy2sfc(s2, as_points = FALSE, na.rm=FALSE)
)

mb$expr <- factor(mb$expr, levels=(as_tibble(mb) %>% group_by(expr) %>% summarise(meantime=mean(time)) %>% arrange(meantime))$expr)
mb$na.rm <- FALSE
mb$na.rm[grep("na.rm = TRUE", mb$expr)] <- TRUE
gg <- ggplot(mb, aes(x=time/1000000000, y=expr, color=na.rm)) +
    geom_point() +
    scale_y_discrete("", position = "right") +
    scale_x_continuous("Seconds", expand=c(0.02, 0)) +
    ggtitle("Timings of different methods for polygonizing raster data in R, 5 iterations each")
ggsave(gg, file="timings.png", h=5, w=10)

Result:
Result:

spex::polygonize is by far the fastest. The methods from sf and stars also are missing na.rm so they take a performance hit on a dataset with many NAs, but in any case they do not outperform the other methods. Note that each different function outputs slightly different data: some output SpatialPolygonsDataFrame, some sf and some stars objects.
Overall spex::polygonize is not only the fastest but also the most convenient, outputting a clean simple features collection with a different column for each band (timestep).

Note that, as it should be, polygonizing a multi-band file (s, s2) has almost no performance hit over a single-band file (r, r2).

@edzer
Copy link
Member

edzer commented Nov 26, 2017

Thanks -- helpful!!

@adrfantini
Copy link
Author

adrfantini commented Mar 13, 2018

@mdsumner
Would like to point out that gdal_polygonize.py, for some reason, is still ~4-5x faster than spex::polygonize, at least for this one file (only the first timestep). I do not know why I did not include that in the benchmark back then.

@adrfantini
Copy link
Author

@mdsumner but note that gdal_polygonize.py makes use of GDALPolygonize, which truncates floats into ints. There is a GDALFPolygonize function, but it is not used AFAIK.

@edzer
Copy link
Member

edzer commented Mar 20, 2018

Looks like a candidate (or two?) to be added to sf::gdal_utils, seems to follow the same pattern.

@adrfantini
Copy link
Author

Indeed, it would be awesome. Two candidates, I would say. The integer one is very useful for categorical variables, the float version is a lot slower. It's easy enough to modify gdal_polygonize.py to output floats instead, to test: the float version used 10x the disk size and took ~15x times the computation time, for a small test file at least.

@edzer
Copy link
Member

edzer commented Sep 21, 2018

GDALPolygonize is now the default in st_as_sf for stars objects, if possible (raster, regular grid)

@edzer
Copy link
Member

edzer commented Sep 21, 2018

Two examples here.

@adrfantini
Copy link
Author

This really should be a huge improvement to my workflow. I'll do some test against spex::polygonize %>% group_by %>% summarise but this approach should be much faster and cleaner. Thanks!

@edzer
Copy link
Member

edzer commented Sep 21, 2018

It currently writes to disk before polygonizing; I was contemplating writing to a memory driver instead, but haven't sorted out yet how. Not sure if it would matter either.

@adrfantini
Copy link
Author

You could just use /dev/shm, on Linux at least.

@edzer
Copy link
Member

edzer commented Sep 21, 2018

Yeah, but operating systems also use memory to buffer/cache files, so that should not be needed (I hope).

@adrfantini
Copy link
Author

Yep, was suggesting just a temporary workaround for testing purposes.

edzer added a commit that referenced this issue Sep 21, 2018
@edzer
Copy link
Member

edzer commented Sep 21, 2018

st_rasterize also needs an sf update (r-spatial/sf@17d1028).

I thought about squeezing this into st_as_stars, but decided against: polygonizing a raster is lossless, and st_as_sf means going to simple features. st_as_stars doesn't mean going to rasters, and rasterizing is lossy.

Haven't tried out the options argument, might as well work.

@adrfantini
Copy link
Author

I'm getting a:

Error in gdal_polygonize(x, mask, use_integer = use_integer, geotransform = get_geotransform(x)) : 
  could not find function "gdal_polygonize"
sf_extSoftVersion()
          GEOS           GDAL         proj.4 GDAL_with_GEOS 
       "3.5.1"        "2.2.2"        "4.9.2"         "true"

@edzer
Copy link
Member

edzer commented Sep 21, 2018

Reinstall sf, using devtools::install_github("r-spatial/sf@polygonize")

edzer added a commit that referenced this issue Sep 21, 2018
@adrfantini
Copy link
Author

adrfantini commented Sep 21, 2018

The new implementation is 2x faster than spex::polygonize, but the latter does not merge identical polygons, so in total this is 10x faster than spex::polygonize %>% group_by %>% summarise. However, the latter works for multi-layer raster files (returning a different column for each), while st_as_sf only seems to polygonize one layer (i'm not sure which, I need more testing).

EDIT: this is on an oldish i7 machine with only rotational drives.

@edzer
Copy link
Member

edzer commented Sep 21, 2018

Thanks. Yes, I chose to make this only work for a single layer, because cells are merged and this would happen differently for every layer: the input of GDALPolygonize is a single raster band.

@adrfantini
Copy link
Author

Yes, I agree with this. Should output a warning tho.

@adrfantini
Copy link
Author

Also a warning for the fact that it only polygonizes the first variable, other than the first 'layer', would be necessary.

@adrfantini
Copy link
Author

I did another test on a different dataset. The st_as_sf method took 164ms, the spex::polygonize %>% group_by %>% summarise method took 34295ms.

~200x improvement!

The slow part is group_by %>% summarise by the way, it takes ~31 of the 34 seconds of the slow method.

This is on a modern laptop with a decent SSD.

@mdsumner
Copy link
Member

mdsumner commented Sep 23, 2018

Is this point-interpretation (like contour,filled.contour) or area (like rasterToPolygons, spex::polygonize)? Can that be controlled with GDAL? I will check it out but keen to discuss a bit.

@edzer
Copy link
Member

edzer commented Sep 23, 2018

Description here, the only controllable option I can see is

  • 8CONNECTED: May be set to 8 to use 8 connectedness. Otherwise 4 connectedness will be applied to the algorithm

You'd set it by passing options = "8CONNECTED=8". I haven't sorted out what this means, but my feeling goes into when identical valued cells are grouped together (like queen or rook?).

@adrfantini
Copy link
Author

adrfantini commented Sep 23, 2018

@edzer you are correct.
@mdsumner gdal_polygonize does what you call 'area-interpretation'.

@mdsumner
Copy link
Member

Yes, thanks! I just did some quick checks with gdal_polygonize.py - it's definitely a different thing. FWIW, I have explored the code under filled.contour and would like to bring it out for the point-interpretation, I often need it and there are ways to achieve it currently but it's pretty painful. Unioning the data structures from filled.contour with sf is probably promising (it's not exactly related to this issue here, but I wanted to air it to make sure). It's also the difference between rgl::shade3d and rgl::rgl.surface, I've recently figured out.

I'm impressed that it's so fast, spex::polygonize can go in the bin, finally :)

@edzer
Copy link
Member

edzer commented Sep 23, 2018

I agree, it would be extremely cool if we could get contour polygons out, in addition to those following the pixel boundaries. Since stars already has a field indicating whether pixels refer to points or areas, we could use that to guide the default.

@adrfantini
Copy link
Author

@mdsumner spex::polygonize still has its uses, it is still much faster than st_as_sf when not going through GDAL_polygonize, i.e. with multivariate / multilayer files.

@edzer
Copy link
Member

edzer commented Sep 28, 2018

It seems that GDALContourGenerateEx has an option to create contour polygons, rather than polygon lines. Worth exploring?

@mdsumner
Copy link
Member

oh gosh - I've been scratching around - thought to search GDAL for marching squares while I was marching around but didn't actually do it! I've learnt enough about the basic problem, this looks great!

@mdsumner
Copy link
Member

Oh, it's brand new - looks like -p only since 2.4 https://gdal.org/gdal_contour.html

@edzer
Copy link
Member

edzer commented Sep 28, 2018

Indeed, fresh from the needle!

edzer added a commit to r-spatial/sf that referenced this issue Sep 28, 2018
edzer added a commit to r-spatial/sf that referenced this issue Sep 28, 2018
edzer added a commit that referenced this issue Sep 28, 2018
edzer added a commit to r-spatial/sf that referenced this issue Sep 29, 2018
@edzer
Copy link
Member

edzer commented Sep 29, 2018

x

@mdsumner
Copy link
Member

That is so great, it's going to be really powerful!

edzer added a commit to r-spatial/sf that referenced this issue Sep 29, 2018
@adrfantini
Copy link
Author

Is this much faster than cutting then polygonizing with st_as_sf?

@edzer
Copy link
Member

edzer commented Sep 29, 2018

How does this look?

library(stars)
# Loading required package: abind
# Loading required package: sf
# Linking to GEOS 3.5.1, GDAL 2.4.0dev-04296c6, PROJ 5.1.0
tif = system.file("tif/L7_ETMs.tif", package = "stars")
x = read_stars(tif)[,1:100,1:100,6] # 100 x 100 area
summary(x[[1]])
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   11.00   32.00   41.00   49.09   59.00  255.00 
breaks = c(0, 70, 90, 255)
system.time(s1 <- st_as_sf(cut(x, breaks = breaks)))
#    user  system elapsed 
#   0.090   0.004   0.093 
system.time(s2 <- st_as_sf(x, as_points = TRUE, breaks = breaks))
#    user  system elapsed 
#    0.38    0.00    0.38 

s1:
s1
s2:
s2

@adrfantini
Copy link
Author

adrfantini commented Sep 30, 2018

Personally I am not a fan of contouring gridded data, it seems an unnecessary tinkering that can sometimes lead to misunderstandings (e.g.: it hides what the real resolution of the data is).
However I realize I am a minority in believing this, so I see this as a very welcome addition for most people.
@edzer, do you want to keep this open or should I close?

@edzer
Copy link
Member

edzer commented Sep 30, 2018

Thanks!

I think we need to distinguish between resolution (# of pixels per area or length unit) and support (the size of the element a pixel refers to). stars objects have a field for support for each dimension, called point: TRUE if the pixel values refer to point locations (e.g. a DEM), FALSE if they are constant or aggregates over the whole pixel, NA if unknown. GDAL has a metadata field called AREA_OR_POINT, which stars reads. In case we have points, contouring makes more sense than polygonizing the grid cells. This all needs to be written down better (in the data model vignette?), but is IMO a pretty essential part of stars' deisgn. In the above, I'm planning to make this automatic, and but raise an error (i.e., user is required to set as_points) in case the points field is NA.

Yes, OK to close here.

@adrfantini
Copy link
Author

I'm used to dealing with climate models (so the value is an aggregate over the pixel area) but I see your point, makes a lot of sense. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants