Performance of different polygonizing methods analyzed #13

adrfantini · 2017-11-26T19:29:36Z

From #12 I performed some tests on the performance of different polygonizing methods for raster data, both for stars and for raster data.

The input files are the same used in the first blog post about stars, that is the NetCDF file from ftp://ftp.dwd.de/pub/data/gpcc/full_data_daily_V1/full_data_daily_2013.nc.gz.

Code:

library(microbenchmark)
library(ggplot2)

inputFile <- "full_data_daily_2013.nc" #Same data used in star's first blog post:http://r-spatial.org/r/2017/11/23/stars1.html#reading-a-raster-time-series-netcdf

r <- raster::raster(inputFile, varname="p")
s <- raster::stack(inputFile, varname="p", bands=1:2)

r2 <- stars::st_as_stars(r)
s2 <- stars::st_as_stars(s)

mb <- microbenchmark(times=5,
    tr1 <- spex::polygonize(r, na.rm=TRUE),
    tr2 <- raster::rasterToPolygons(r, na.rm=TRUE),
    tr3 <- as(raster::rasterToPolygons(r, na.rm=TRUE), "sf"),
    tr4 <- sf::st_as_sf(r2, as_points = FALSE, na.rm=TRUE),
    tr5 <- stars::st_xy2sfc(r2, as_points = FALSE, na.rm=TRUE),
    ts1 <- spex::polygonize(s, na.rm=TRUE),
    ts2 <- raster::rasterToPolygons(s, na.rm=TRUE),
    ts3 <- as(raster::rasterToPolygons(s, na.rm=TRUE), "sf"),
    ts4 <- sf::st_as_sf(s2, as_points = FALSE, na.rm=TRUE),
    ts5 <- stars::st_xy2sfc(s2, as_points = FALSE, na.rm=TRUE),
    fr1 <- spex::polygonize(r, na.rm=FALSE),
    fr2 <- raster::rasterToPolygons(r, na.rm=FALSE),
    fr3 <- as(raster::rasterToPolygons(r, na.rm=FALSE), "sf"),
    fr4 <- sf::st_as_sf(r2, as_points = FALSE, na.rm=FALSE),
    fr5 <- stars::st_xy2sfc(r2, as_points = FALSE, na.rm=FALSE),
    fs1 <- spex::polygonize(s, na.rm=FALSE),
    fs2 <- raster::rasterToPolygons(s, na.rm=FALSE),
    fs3 <- as(raster::rasterToPolygons(s, na.rm=FALSE), "sf"),
    fs4 <- sf::st_as_sf(s2, as_points = FALSE, na.rm=FALSE),
    fs5 <- stars::st_xy2sfc(s2, as_points = FALSE, na.rm=FALSE)
)

mb$expr <- factor(mb$expr, levels=(as_tibble(mb) %>% group_by(expr) %>% summarise(meantime=mean(time)) %>% arrange(meantime))$expr)
mb$na.rm <- FALSE
mb$na.rm[grep("na.rm = TRUE", mb$expr)] <- TRUE
gg <- ggplot(mb, aes(x=time/1000000000, y=expr, color=na.rm)) +
    geom_point() +
    scale_y_discrete("", position = "right") +
    scale_x_continuous("Seconds", expand=c(0.02, 0)) +
    ggtitle("Timings of different methods for polygonizing raster data in R, 5 iterations each")
ggsave(gg, file="timings.png", h=5, w=10)

Result:

spex::polygonize is by far the fastest. The methods from sf and stars also are missing na.rm so they take a performance hit on a dataset with many NAs, but in any case they do not outperform the other methods. Note that each different function outputs slightly different data: some output SpatialPolygonsDataFrame, some sf and some stars objects.
Overall spex::polygonize is not only the fastest but also the most convenient, outputting a clean simple features collection with a different column for each band (timestep).

Note that, as it should be, polygonizing a multi-band file (s, s2) has almost no performance hit over a single-band file (r, r2).

The text was updated successfully, but these errors were encountered:

edzer · 2017-11-26T20:19:37Z

Thanks -- helpful!!

adrfantini · 2018-03-13T21:18:26Z

@mdsumner
Would like to point out that gdal_polygonize.py, for some reason, is still ~4-5x faster than spex::polygonize, at least for this one file (only the first timestep). I do not know why I did not include that in the benchmark back then.

adrfantini · 2018-03-20T13:28:50Z

@mdsumner but note that gdal_polygonize.py makes use of GDALPolygonize, which truncates floats into ints. There is a GDALFPolygonize function, but it is not used AFAIK.

edzer · 2018-03-20T13:33:09Z

Looks like a candidate (or two?) to be added to sf::gdal_utils, seems to follow the same pattern.

adrfantini · 2018-03-20T13:50:04Z

Indeed, it would be awesome. Two candidates, I would say. The integer one is very useful for categorical variables, the float version is a lot slower. It's easy enough to modify gdal_polygonize.py to output floats instead, to test: the float version used 10x the disk size and took ~15x times the computation time, for a small test file at least.

edzer · 2018-09-21T11:00:52Z

GDALPolygonize is now the default in st_as_sf for stars objects, if possible (raster, regular grid)

edzer · 2018-09-21T11:36:13Z

Two examples here.

adrfantini · 2018-09-21T11:52:41Z

This really should be a huge improvement to my workflow. I'll do some test against spex::polygonize %>% group_by %>% summarise but this approach should be much faster and cleaner. Thanks!

edzer · 2018-09-21T11:57:10Z

It currently writes to disk before polygonizing; I was contemplating writing to a memory driver instead, but haven't sorted out yet how. Not sure if it would matter either.

adrfantini · 2018-09-21T11:58:06Z

You could just use /dev/shm, on Linux at least.

edzer · 2018-09-21T11:59:33Z

Yeah, but operating systems also use memory to buffer/cache files, so that should not be needed (I hope).

adrfantini · 2018-09-21T12:44:06Z

Yep, was suggesting just a temporary workaround for testing purposes.

edzer · 2018-09-21T15:13:14Z

st_rasterize also needs an sf update (r-spatial/sf@17d1028).

I thought about squeezing this into st_as_stars, but decided against: polygonizing a raster is lossless, and st_as_sf means going to simple features. st_as_stars doesn't mean going to rasters, and rasterizing is lossy.

Haven't tried out the options argument, might as well work.

adrfantini · 2018-09-21T15:46:17Z

I'm getting a:

Error in gdal_polygonize(x, mask, use_integer = use_integer, geotransform = get_geotransform(x)) : 
  could not find function "gdal_polygonize"

sf_extSoftVersion()
          GEOS           GDAL         proj.4 GDAL_with_GEOS 
       "3.5.1"        "2.2.2"        "4.9.2"         "true"

edzer · 2018-09-21T15:57:48Z

Reinstall sf, using devtools::install_github("r-spatial/sf@polygonize")

adrfantini · 2018-09-21T16:35:24Z

The new implementation is 2x faster than spex::polygonize, but the latter does not merge identical polygons, so in total this is 10x faster than spex::polygonize %>% group_by %>% summarise. However, the latter works for multi-layer raster files (returning a different column for each), while st_as_sf only seems to polygonize one layer (i'm not sure which, I need more testing).

EDIT: this is on an oldish i7 machine with only rotational drives.

edzer · 2018-09-21T17:00:58Z

Thanks. Yes, I chose to make this only work for a single layer, because cells are merged and this would happen differently for every layer: the input of GDALPolygonize is a single raster band.

adrfantini · 2018-09-21T18:16:17Z

Yes, I agree with this. Should output a warning tho.

adrfantini · 2018-09-23T08:02:52Z

Also a warning for the fact that it only polygonizes the first variable, other than the first 'layer', would be necessary.

adrfantini · 2018-09-23T08:16:10Z

I did another test on a different dataset. The st_as_sf method took 164ms, the spex::polygonize %>% group_by %>% summarise method took 34295ms.

~200x improvement!

The slow part is group_by %>% summarise by the way, it takes ~31 of the 34 seconds of the slow method.

This is on a modern laptop with a decent SSD.

mdsumner · 2018-09-23T08:44:58Z

Is this point-interpretation (like contour,filled.contour) or area (like rasterToPolygons, spex::polygonize)? Can that be controlled with GDAL? I will check it out but keen to discuss a bit.

edzer · 2018-09-23T08:55:35Z

Description here, the only controllable option I can see is

8CONNECTED: May be set to 8 to use 8 connectedness. Otherwise 4 connectedness will be applied to the algorithm

You'd set it by passing options = "8CONNECTED=8". I haven't sorted out what this means, but my feeling goes into when identical valued cells are grouped together (like queen or rook?).

adrfantini · 2018-09-23T09:06:04Z

@edzer you are correct.
@mdsumner gdal_polygonize does what you call 'area-interpretation'.

mdsumner · 2018-09-23T09:34:08Z

Yes, thanks! I just did some quick checks with gdal_polygonize.py - it's definitely a different thing. FWIW, I have explored the code under filled.contour and would like to bring it out for the point-interpretation, I often need it and there are ways to achieve it currently but it's pretty painful. Unioning the data structures from filled.contour with sf is probably promising (it's not exactly related to this issue here, but I wanted to air it to make sure). It's also the difference between rgl::shade3d and rgl::rgl.surface, I've recently figured out.

I'm impressed that it's so fast, spex::polygonize can go in the bin, finally :)

edzer · 2018-09-23T09:40:01Z

I agree, it would be extremely cool if we could get contour polygons out, in addition to those following the pixel boundaries. Since stars already has a field indicating whether pixels refer to points or areas, we could use that to guide the default.

adrfantini · 2018-09-24T12:24:48Z

@mdsumner spex::polygonize still has its uses, it is still much faster than st_as_sf when not going through GDAL_polygonize, i.e. with multivariate / multilayer files.

edzer · 2018-09-28T13:49:49Z

It seems that GDALContourGenerateEx has an option to create contour polygons, rather than polygon lines. Worth exploring?

mdsumner · 2018-09-28T14:00:25Z

oh gosh - I've been scratching around - thought to search GDAL for marching squares while I was marching around but didn't actually do it! I've learnt enough about the basic problem, this looks great!

mdsumner · 2018-09-28T14:03:33Z

Oh, it's brand new - looks like -p only since 2.4 https://gdal.org/gdal_contour.html

edzer · 2018-09-28T14:46:22Z

Indeed, fresh from the needle!

edzer · 2018-09-29T08:17:44Z

mdsumner · 2018-09-29T08:33:57Z

That is so great, it's going to be really powerful!

adrfantini · 2018-09-29T12:49:28Z

Is this much faster than cutting then polygonizing with st_as_sf?

edzer · 2018-09-29T14:11:30Z

How does this look?

library(stars)
# Loading required package: abind
# Loading required package: sf
# Linking to GEOS 3.5.1, GDAL 2.4.0dev-04296c6, PROJ 5.1.0
tif = system.file("tif/L7_ETMs.tif", package = "stars")
x = read_stars(tif)[,1:100,1:100,6] # 100 x 100 area
summary(x[[1]])
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   11.00   32.00   41.00   49.09   59.00  255.00 
breaks = c(0, 70, 90, 255)
system.time(s1 <- st_as_sf(cut(x, breaks = breaks)))
#    user  system elapsed 
#   0.090   0.004   0.093 
system.time(s2 <- st_as_sf(x, as_points = TRUE, breaks = breaks))
#    user  system elapsed 
#    0.38    0.00    0.38

s1:

s2:

adrfantini · 2018-09-30T12:19:15Z

Personally I am not a fan of contouring gridded data, it seems an unnecessary tinkering that can sometimes lead to misunderstandings (e.g.: it hides what the real resolution of the data is).
However I realize I am a minority in believing this, so I see this as a very welcome addition for most people.
@edzer, do you want to keep this open or should I close?

edzer · 2018-09-30T12:45:24Z

Thanks!

I think we need to distinguish between resolution (# of pixels per area or length unit) and support (the size of the element a pixel refers to). stars objects have a field for support for each dimension, called point: TRUE if the pixel values refer to point locations (e.g. a DEM), FALSE if they are constant or aggregates over the whole pixel, NA if unknown. GDAL has a metadata field called AREA_OR_POINT, which stars reads. In case we have points, contouring makes more sense than polygonizing the grid cells. This all needs to be written down better (in the data model vignette?), but is IMO a pretty essential part of stars' deisgn. In the above, I'm planning to make this automatic, and but raise an error (i.e., user is required to set as_points) in case the points field is NA.

Yes, OK to close here.

adrfantini · 2018-09-30T12:53:22Z

I'm used to dealing with climate models (so the value is an aggregate over the pixel area) but I see your point, makes a lot of sense. Thanks.

This was referenced Nov 26, 2017

Errors reading NetCDF files coming from climate model #12

Closed

polygonize.stars mdsumner/spex#11

Closed

This was referenced Mar 22, 2018

Faster rasterToPolygons() ecohealthalliance/fasterize#12

Closed

Add geom_stars for ggplot plotting #21

Closed

edzer added a commit to r-spatial/sf that referenced this issue Sep 21, 2018

add gdal_polygonize; r-spatial/stars#13

abb8b5f

edzer added a commit that referenced this issue Sep 21, 2018

use GDAL(F)Polygonize in st_as_sf.stars; #13

2ead896

edzer added a commit that referenced this issue Sep 21, 2018

add st_rasterize; #13

5ba57a0

edzer added a commit that referenced this issue Sep 21, 2018

make useful names; #13

95d681d

edzer added a commit to r-spatial/sf that referenced this issue Sep 28, 2018

add contourize; r-spatial/stars#13

07ab664

edzer added a commit to r-spatial/sf that referenced this issue Sep 28, 2018

update contourize; r-spatial/stars#13

4802241

edzer added a commit that referenced this issue Sep 28, 2018

update contourize; #13

7d96c79

edzer added a commit to r-spatial/sf that referenced this issue Sep 29, 2018

add ID and ELEV fields for contour bands; r-spatial/stars#13

9ad11de

edzer added a commit to r-spatial/sf that referenced this issue Sep 29, 2018

user interface for contouring; r-spatial/stars#13

4441631

adrfantini closed this as completed Sep 30, 2018

Performance of different polygonizing methods analyzed #13

Performance of different polygonizing methods analyzed #13

Comments

adrfantini commented Nov 26, 2017 • edited Loading

edzer commented Nov 26, 2017

adrfantini commented Mar 13, 2018 • edited Loading

adrfantini commented Mar 20, 2018

edzer commented Mar 20, 2018

adrfantini commented Mar 20, 2018

edzer commented Sep 21, 2018

edzer commented Sep 21, 2018

adrfantini commented Sep 21, 2018

edzer commented Sep 21, 2018

adrfantini commented Sep 21, 2018

edzer commented Sep 21, 2018

adrfantini commented Sep 21, 2018

edzer commented Sep 21, 2018

adrfantini commented Sep 21, 2018

edzer commented Sep 21, 2018

adrfantini commented Sep 21, 2018 • edited Loading

edzer commented Sep 21, 2018

adrfantini commented Sep 21, 2018

adrfantini commented Sep 23, 2018

adrfantini commented Sep 23, 2018

mdsumner commented Sep 23, 2018 • edited Loading

edzer commented Sep 23, 2018

adrfantini commented Sep 23, 2018 • edited Loading

mdsumner commented Sep 23, 2018

edzer commented Sep 23, 2018

adrfantini commented Sep 24, 2018

edzer commented Sep 28, 2018

mdsumner commented Sep 28, 2018

mdsumner commented Sep 28, 2018

edzer commented Sep 28, 2018

edzer commented Sep 29, 2018

mdsumner commented Sep 29, 2018

adrfantini commented Sep 29, 2018

edzer commented Sep 29, 2018 • edited Loading

adrfantini commented Sep 30, 2018 • edited Loading

edzer commented Sep 30, 2018

adrfantini commented Sep 30, 2018

adrfantini commented Nov 26, 2017 •

edited

Loading

adrfantini commented Mar 13, 2018 •

edited

Loading

adrfantini commented Sep 21, 2018 •

edited

Loading

mdsumner commented Sep 23, 2018 •

edited

Loading

adrfantini commented Sep 23, 2018 •

edited

Loading

edzer commented Sep 29, 2018 •

edited

Loading

adrfantini commented Sep 30, 2018 •

edited

Loading