-
Notifications
You must be signed in to change notification settings - Fork 42
/
Copy path04-Soil_covariates.Rmd
executable file
·694 lines (543 loc) · 39.5 KB
/
04-Soil_covariates.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
# Preparation of soil covariates for soil mapping {#soil-covs-chapter}
*Edited by: T. Hengl and R. A. MacMillan*
## Soil covariate data sources
### Types of soil covariates
Soils (and vegetation + ecosystems) form under complex interactions between
climate, living organism and anthropogenic influences, modified by relief and
hydrological processes and operating over long periods of time.
This has been clearly identified first by @jenny1994factors with his CLORPT
factors of soil formation and subsequently extended by @McBratney2003Geoderma
with the SCORPAN formulation (see section \@ref(soil-mapping-theory)).
The following groups of covariates are commonly considered for use in
Predictive Soil Mapping:
1. Climate related covariates, which include:
- temperature maps,
- precipitation maps,
- snow cover maps,
- potential evapotranspiration,
- cloud fraction and other atmospheric images,
2. Vegetation and living organisms, which include:
- vegetation indices e.g. FAPAR (mean, median), NDVI, EVI,
- biomass, Leaf Area Index,
- land cover type maps,
- vegetation types and communities (if mapped at high accuracy),
- land cover,
3. Relief and topography-related covariates, which include:
- standard window-based calculations e.g. slope, curvatures, standard deviation,
- standard flow model outputs,
- landform classes / landform class likelihoods,
- hydrological / soil accumulation and deposition indices — MRVBFI, Wetness index, height above channel, height below ridge, horizontal distance to channel, horizontal distance to ridge,
- climatic and micro-climatic indices determined by relief e.g. incoming solar insolation and similar,
4. Parent material / geologic material covariates, which include:
- bedrock type and age,
- bedrock mineralogy (acid, basic),
- surface material type, texture, age, mineralogy, thickness,
- volcanic activity, historic earthquake density,
- seismic activity level,
- gamma ray spectroscopy grids,
- gravity measurements,
- electrical conductivity/resistance,
5. Estimated geological age of surface, which include:
- bedrock age / surface material age,
- recent disturbance age,
6. Spatial position or spatial context, which include:
- latitude and longitude,
- distance to nearest large ocean
- Northing — distance to north pole,
- Southing — distance to south pole,
- Easting — distance to east,
- Westing — distance to west,
- shortest distance in any direction,
- distance to nearest high mountain,
- distance to nearest moderate hill,
- distance to nearest major river,
7. Human or Anthropogenic Influences, which include:
- land use / land management maps,
- probability / intensity of agricultural land use,
- probability / intensity of pasture or grazing use,
- probability / intensity of forest land management,
- probability / intensity of urbanization,
- soil dredging, surface sealing,
- night time illumination (nightlights) images,
- probability of gullying or human-induced erosion,
- soil nutrient fertilization, liming and similar maps,
In the following sections we provide some practical advice (with links to the most important data sources) on how to prepare soil covariates for use in PSM.
### Soil covariate data sources (30–100 m resolution) {#soil-covs-30m}
Adding relevant covariates that can help explain the distribution of soil properties increases the accuracy of spatial predictions. Hence, prior to generating predictions of soil properties, it is a good idea to invest in preparing a list of Remote Sensing (RS), geomorphological/lithologic and DEM-based covariates that can potentially help explain the spatial distribution of soil properties and classes. There are now many finer resolution (30–250 m) covariates with global coverage that are publicly available without restrictions. The spatial detail, accessibility and accuracy of RS-based products has been growing exponentially and there is no evidence that this trend is going to slow down in the coming decades [@Herold2016].
The most relevant (global) publicly available remote sensing-based covariates that can be downloaded and used to improve predictive soil mapping at high spatial resolutions are, for example:
* [SRTM](https://lta.cr.usgs.gov/SRTM1Arc) and/or [ALOS W3D](http://www.eorc.jaxa.jp/ALOS/en/aw3d/index_e.htm) Digital Elevation Model (DEM) at 30 m and [MERIT DEM](http://hydro.iis.u-tokyo.ac.jp/~yamadai/MERIT_DEM/) at 100 m (these can be used to derive at least 8–12 DEM derivatives of which some generally prove to be beneficial for mapping of soil chemical and hydrological properties);
* Landsat 7, 8 satellite images, either available from USGS's [GloVis](http://glovis.usgs.gov/) / [EarthExplorer](http://earthexplorer.usgs.gov/), or from the [GlobalForestChange project](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.2.html) repository [@hansen2013high];
* [Landsat-based Global Surface Water (GSW) dynamics images](https://global-surface-water.appspot.com/download) at 30 m resolution for the period 1985–2016 [@pekel2016high];
* Global Land Cover (GLC) maps based on the [GLC30 project](http://www.globallandcover.com) at 30 m resolution for 2000 and 2010 [@Chen2014] and similar land cover projects [@Herold2016];
* USGS's [global bare surface images](https://landcover.usgs.gov/glc/) at 30 m resolution;
* [JAXA's ALOS](http://www.eorc.jaxa.jp/ALOS/en/dataset/dataset_index.htm) (PALSAR/PALSAR-2) radar images at 20 m resolution [@shimada2014new]; radar images, bands HH: -27.7 (5.3) dB and HV: -35.8 (3.0) dB, from the JAXA's ALOS project are especially interesting for mapping rock outcrops and exposed bedrock but are also used to distinguish between bare soil and dense vegetation;
Note that the required download time for 30 m global RS data can be significant
if the data are needed for a larger area (hence you might consider using some
RS data processing hub such as [Sentinel hub](http://www.sentinel-hub.com),
[Google Earth Engine](https://earthengine.google.com) and/or
[Amazon Web Services](https://aws.amazon.com/public-datasets/) instead of trying
to download large mosaics yourself).
Most recently soil mappers can also use more advanced (commercial) remote
sensing products often available at finer spatial resolution which include:
* WorldDEM (https://worlddem-database.terrasar.com) at 12 m resolution multiband elevation products,
* German hyperspectral satellite mission EnMAP (http://www.enmap.org) products, which have shown to be useful for mapping soil nutrients and minerals [@rs8070613],
* Sentinel-1 soil moisture products, currently limited to 1 km to 500 m resolutions but available at fast revisit times [@Bauer-Marschallinger2019],
Hyperspectral imaging systems, similar to field-based soil spectroscopy, and the upcoming missions such as SHALOM (Italy and Israel), HypXIM (France) and HypsIRI (USA) will most likely revolutionaize use of remote sensing for soil mapping.
### Soil covariate data sources (250 m resolution or coarser) {#soil-covs-250m}
@Hengl2017SoilGrids250m used a large stack of slightly coarser resolution covariate layers for producing
SoilGrids250m predictions, most of which were based on remote sensing data:
- DEM-derived surfaces — slope, profile curvature, Multiresolution
Index of Valley Bottom Flatness (VBF), deviation from Mean Value,
valley depth, negative and positive Topographic Openness and SAGA
Wetness Index — all based on a global merge of SRTMGL3 DEM and
GMTED2010 [@Danielson2011GMTED]. All DEM derivatives were computed
using SAGA GIS [@gmd-8-1991-2015],
- Long-term averaged monthly mean and standard deviation of the MODIS
Enhanced Vegetation Index (EVI). Derived using a stack of MOD13Q1
EVI images [@Savtchenko2004ASR],
- Long-term averaged mean monthly surface reflectances for MODIS bands
4 (NIR) and 7 (MIR). Derived using a stack of MCD43A4 images
[@mira2015modis],
- Long-term averaged monthly mean and standard deviation of the MODIS
land surface temperature (daytime and nighttime). Derived using a
stack of MOD11A2 LST images [@wan2006modis],
- Long-term averaged mean monthly hours under snow cover based on a
stack of MOD10A2 8-day snow occurrence images [@hall2007accuracy],
- Land cover classes (cultivated land, forests, grasslands,
shrublands, wetlands, tundra, artificial surfaces and
bareland cover) for the year 2010 based on the GlobCover30 product
by the National Geomatics Center of China [@Chen2014]. Upscaled to
250 m resolution and expressed as percent of pixel coverage,
- Monthly precipitation images derived as the weighted average between
the WorldClim monthly precipitation [@Hijmans2005IJC] and GPCP Version
2.2 [@huffman2009gpcp],
- Long-term averaged mean monthly hours under snow cover. Derived
using a stack of MOD10A2 8-day snow occurrence images,
- Lithologic units (acid plutonics, acid volcanic, basic plutonics,
basic volcanics, carbonate sedimentary rocks, evaporite, ice and
glaciers, intermediate plutonics, intermediate volcanics,
metamorphics, mixed sedimentary rocks, pyroclastics, siliciclastic
sedimentary rocks, unconsolidated sediment) based on a Global
Lithological Map GLiM [@GGGE:GGGE2352],
- Landform classes (breaks/foothills, flat plains, high mountains/deep
canyons, hills, low hills, low mountains, smooth plains) based on
the USGS’s Map of Global Ecological Land Units [@sayre2014new].
- Global Water Table Depth in meters [@fan2013global],
- Landsat-based estimated distribution of Mangroves [@giri2011status],
- Average soil and sedimentary-deposit thickness in meters [@Pelletier2016].
The covariates above were selected to represent factors of soil formation
according to @jenny1994factors: climate, relief, living organisms,
water dynamics and parent material. Of the five main factors, water
dynamics and living organisms (especially vegetation dynamics) are not
trivial to represent, as these operate over long periods of time and
often exhibit chaotic behavior. Using reflectance bands such as the
mid-infrared MODIS bands from a single day, would be of little use for
soil mapping for areas with dynamic vegetation, i.e. with strong
seasonal changes in vegetation cover. To account for seasonal
fluctuation and for inter-annual variations in surface reflectance,
long-term temporal signatures of the soil surface derived
as monthly averages from long-term MODIS imagery (15 years of data)
can be more effective to use [@Hengl2017SoilGrids250m].
Long-term average seasonal signatures of surface reflectance or vegetation
index provide a better indication of soil characteristics than only a single snapshot of
surface reflectance. Computing temporal signatures of the land surface
requires a considerable investment of time (comparable to the generation
of climatic images vs temporary weather maps), but it is probably the
best way to effectively represent the cumulative influence of living organisms on
soil formation.
@Behrens2018128 recently reported that, for example, DEM
derivatives derived at coarser resolutions correlated better with some
targeted soil properties than the same derivatives derived at finer resolutions. In
this case, resolution (or scale) was represented through various DEM aggregation levels
and filter sizes. Some physical and chemical processes of soil formation
or vegetation distribution might not be effective or obvious at finer aggregation
levels, but these can become very visible at coarser aggregation levels. In
fact, it seems that spatial dependencies and interactions of the
covariates can often be explained better simply by aggregating DEM and its
derivatives [@Behrens2018128].
## Preparing soil covariate layers
Before we are able to fit spatial prediction models and generate soil maps, a significant amount of effort is first spent on preparing covariate “layers” that can be used as independent variables (i.e. “predictor variables”) in the statistical modelling. Typical operations used to generate soil covariate layers include:
* Converting polygon maps to rasters,
* Downscaling or upscaling (aggregating) rasters to match the target resolution (i.e. preparing a *stack*),
* Filtering out missing pixels / reducing noise and multicolinearity (data overlap) problems,
* Overlaying and subsetting raster stacks and points,
The following examples should provide some ideas about how to program these steps using the most concise possible syntax running the fastest and most robust algorithms. Raster data can often be very large (e.g. millions of pixels) so processing large stacks of remote sensing scenes in R needs to be planned carefully. The complete R tutorial can be downloaded from the [github repository](https://github.com/envirometrix/PredictiveSoilMapping). Instructions on how to install and set-up all software used in this example can be found in the software installation chapter \@ref(software).
### Converting polygon maps to rasters
Before we can attach a polygon map to other stacks of covariates, it needs to be rasterized i.e. converted to a raster layer defined with its bounding box (extent) and spatial resolution. Consider for example the [Ebergötzen data set](http://plotkml.r-forge.r-project.org/eberg.html) polygon map from the plotKML package (Fig. \@ref(fig:eberg-zones-spplot)):
```{r eberg-zones-spplot, fig.cap="Ebergotzen parent material polygon map with legend.", out.width="70%"}
library(rgdal)
library(raster)
library(plotKML)
library(viridis)
data(eberg_zones)
spplot(eberg_zones[1])
```
We can convert this object to a raster by using the [raster package](https://cran.r-project.org/web/packages/raster/). Note that before we can run the operation, we need to know the target grid system i.e. the extent of the grid and its spatial resolution. We can use this from an existing layer:
```{r}
library(plotKML)
data("eberg_grid25")
gridded(eberg_grid25) <- ~x+y
proj4string(eberg_grid25) <- CRS("+init=epsg:31467")
r <- raster(eberg_grid25)
r
```
The `eberg_grids25` object is a `SpatialPixelsDataFrame`, which is a spatial gridded data structure of the [sp package](https://cran.r-project.org/web/packages/sp/) package. The raster package also offers data structures for spatial (gridded) data, and stores such data as `RasterLayer` class. Gridded data can be converted from class `SpatialPixelsDataFrame` to a `Raster` layer with the [`raster`](http://www.rdocumentation.org/packages/raster/functions/raster) command. The
[`CRS`](http://www.inside-r.org/packages/cran/sp/docs/CRS) command of the sp package can be used to set a spatial projection.
[EPSG projection 31467](http://spatialreference.org/ref/epsg/31467/) is the German coordinate system (each coordinate system has an associated EPSG number that can be obtained from http://spatialreference.org/).
Conversion from polygon to raster is now possible via the [`rasterize`](http://www.rdocumentation.org/packages/raster/functions/rasterize) command:
```{r eberg-zones-grid, fig.cap="Ebergotzen parent material polygon map rasterized to 25 m spatial resolution.", out.width="70%"}
names(eberg_zones)
eberg_zones_r <- rasterize(eberg_zones, r, field="ZONES")
plot(eberg_zones_r)
```
Converting large polygons in R using the raster package can be very time-consuming. To speed up the rasterization of polygons we highly recommend using instead the `fasterize` function:
```{r}
library(sf)
library(fasterize)
eberg_zones_sf <- as(eberg_zones, "sf")
eberg_zones_r <- fasterize(eberg_zones_sf, r, field="ZONES")
```
`fasterize` function is an order of magnitude faster and hence more suitable for operational work; it only works with Simple Feature (sf) objects, however, so the `sp` polygon object needs to be first coerced to an `sf` object.
Another efficient approach to rasterize polygons is to use SAGA GIS, which can handle large data and is easy to run in parallel. First, you need to export the polygon map to shapefile format which can be done with commands of the [rgdal package](https://cran.r-project.org/web/packages/rgdal/) package:
```{r}
eberg_zones$ZONES_int <- as.integer(eberg_zones$ZONES)
writeOGR(eberg_zones["ZONES_int"], "extdata/eberg_zones.shp", ".", "ESRI Shapefile")
```
The `writeOGR()` command writes a SpatialPolygonsDataFrame (the data structure for polygon data in R) to an ESRI shapefile. Here we only write the attribute `"ZONES_int"` to the shapefile. It is, however, also possible to write all attributes of the SpatialPolygonsDataFrame to a shapefile.
Next, you can locate the (previously installed) SAGA GIS command line program (on Microsoft Windows OS or Linux system):
```{r}
if(.Platform$OS.type=="unix"){
saga_cmd = "saga_cmd"
}
if(.Platform$OS.type=="windows"){
saga_cmd = "C:/Progra~1/SAGA-GIS/saga_cmd.exe"
}
saga_cmd
```
and finally use the module `grid_gridding` to convert the shapefile to a grid:
```{r}
pix = 25
system(paste0(saga_cmd, ' grid_gridding 0 -INPUT \"extdata/eberg_zones.shp\" ',
'-FIELD \"ZONES_int\" -GRID \"extdata/eberg_zones.sgrd\" -GRID_TYPE 0 ',
'-TARGET_DEFINITION 0 -TARGET_USER_SIZE ', pix, ' -TARGET_USER_XMIN ',
extent(r)[1]+pix/2,' -TARGET_USER_XMAX ', extent(r)[2]-pix/2,
' -TARGET_USER_YMIN ', extent(r)[3]+pix/2,' -TARGET_USER_YMAX ',
extent(r)[4]-pix/2))
eberg_zones_r2 <- readGDAL("extdata/eberg_zones.sdat")
```
With the `system()` command we can invoke an operating system (OS) command, here we use it to run the `saga_cmd.exe` file from R. The paste0 function is used to paste together a string that is passed to the `system()` command. The string starts with the OS command we would like to invoke (here `saga_cmd.exe`) followed by input required for the running the OS command.
Note that the bounding box (in SAGA GIS) needs to be defined using the center of the corner pixel and not the corners, hence we take half of the pixel size for extent coordinates from the raster package. Also note that the class names have been lost during rasterization (we work with integers in SAGA GIS), but we can attach them back by using e.g.:
```{r}
levels(eberg_zones$ZONES)
eberg_zones_r2$ZONES <- as.factor(eberg_zones_r2$band1)
levels(eberg_zones_r2$ZONES) <- levels(eberg_zones$ZONES)
summary(eberg_zones_r2$ZONES)
```
```{r eberg-zones-rasterized, echo=FALSE, fig.cap="Ebergotzen zones rasterized to 25 m resolution and with correct factor labels.", out.width="55%"}
knitr::include_graphics("figures/eberg_zones_rasterized.png")
```
### Downscaling or upscaling (aggregating) rasters {#downscaling-upscaling}
In order for all covariates to perfectly *stack*, we also need to adjust the resolution of some covariates that have either too coarse or too fine a resolution compared to the target resolution. The process of bringing raster layers to a target grid resolution is also known as **resampling**. Consider the following example from the Ebergotzen case study:
```{r}
data(eberg_grid)
gridded(eberg_grid) <- ~x+y
proj4string(eberg_grid) <- CRS("+init=epsg:31467")
names(eberg_grid)
```
In this case we have a few layers that we would like to use for spatial prediction in combination with the maps produced in the previous sections, but their resolution is 100 m i.e. about 16 times coarser. Probably the most robust way to resample rasters is to use the [`gdalwarp`](http://www.gdal.org/gdalwarp.html) function from the GDAL software. Assuming that you have already installed GDAL, you only need to locate the program on your system, and then you can again run [`gdalwarp`](http://www.gdal.org/gdalwarp.html) via the system command:
```{r}
writeGDAL(eberg_grid["TWISRT6"], "extdata/eberg_grid_TWISRT6.tif")
system(paste0('gdalwarp extdata/eberg_grid_TWISRT6.tif',
' extdata/eberg_grid_TWISRT6_25m.tif -r \"cubicspline\" -te ',
paste(as.vector(extent(r))[c(1,3,2,4)], collapse=" "),
' -tr ', pix, ' ', pix, ' -overwrite'))
```
The `writeGDAL` command writes the TWISRT6 grid, that is stored in the eberg_grid grid stack, to a TIFF file. This TIFF is subsequently read by the `gdalwarp` function and resampled to a 25 m TIFF file using `cubicspline`, which will fill in values between original grid nodes using smooth surfaces. Note that the paste0 function in the `system()` command pastes together the following string:
```{bash, eval=FALSE}
"C:/Progra~1/GDAL/gdalwarp.exe eberg_grid_TWISRT6.tif
eberg_grid_TWISRT6_25m.tif -r \"cubicspline\"
-te 3570000 5708000 3580000 5718000 -tr 25 25 -overwrite"
```
We can compare the two maps (the original and the downscaled) next to each other by using:
```{r, eval=FALSE, echo=FALSE, results='hide', fig.keep='none'}
par(mfrow=c(1,2))
zlim = range(eberg_grid$TWISRT6, na.rm=TRUE)
image(raster(eberg_grid["TWISRT6"]), col=SAGA_pal[[1]], zlim=zlim, main="Original", asp=1)
image(raster("extdata/eberg_grid_TWISRT6_25m.tif"), col=SAGA_pal[[1]], zlim=zlim, main="Downscaled", asp=1)
```
```{r eberg-original-vs-downscaled, echo=FALSE, fig.cap="Original TWI vs downscaled map from 100 m to 25 m.", out.width="100%"}
knitr::include_graphics("figures/eberg_original_vs_downscaled.png")
```
The map on the right looks much smoother of course (assuming that this variable varies continuously in space, this could very well be an accurate picture), but it is important to realize that downscaling can only be implemented up to certain target resolution i.e. only for certain features. For example, downscaling TWI from 100 to 25 m is not much of problem, but to go beyond 10 m would probably result in large differences from a TWI calculated at 10 m resolution (in other words: be careful with downscaling because it is often not trivial).
The opposite process to downscaling is upscaling or aggregation. Although this one can also potentially be tricky, it is a much more straightforward process than downscaling. We recommend using the `average` method in GDAL for aggregating values e.g.:
```{r}
system(paste0('gdalwarp extdata/eberg_grid_TWISRT6.tif',
' extdata/eberg_grid_TWISRT6_250m.tif -r \"average\" -te ',
paste(as.vector(extent(r))[c(1,3,2,4)], collapse=" "),
' -tr 250 250 -overwrite'))
```
```{r eberg-original-vs-aggregated, echo=FALSE, fig.cap="Original TWI vs aggregated map from 100 m to 250 m.", out.width="100%"}
knitr::include_graphics("figures/eberg_original_vs_aggregated.png")
```
### Deriving DEM parameters using SAGA GIS
Now that we have established a connection between R and SAGA GIS, we can also use SAGA GIS to derive some standard DEM parameters of interest to soil mapping. To automate further processing, we make the following function:
```{r}
saga_DEM_derivatives <- function(INPUT, MASK=NULL,
sel=c("SLP","TWI","CRV","VBF","VDP","OPN","DVM")){
if(!is.null(MASK)){
## Fill in missing DEM pixels:
suppressWarnings( system(paste0(saga_cmd,
' grid_tools 25 -GRID=\"', INPUT,
'\" -MASK=\"', MASK, '\" -CLOSED=\"',
INPUT, '\"')) )
}
## Slope:
if(any(sel %in% "SLP")){
try( suppressWarnings( system(paste0(saga_cmd,
' ta_morphometry 0 -ELEVATION=\"',
INPUT, '\" -SLOPE=\"',
gsub(".sgrd", "_slope.sgrd", INPUT),
'\" -C_PROF=\"',
gsub(".sgrd", "_cprof.sgrd", INPUT), '\"') ) ) )
}
## TWI:
if(any(sel %in% "TWI")){
try( suppressWarnings( system(paste0(saga_cmd,
' ta_hydrology 15 -DEM=\"',
INPUT, '\" -TWI=\"',
gsub(".sgrd", "_twi.sgrd", INPUT), '\"') ) ) )
}
## MrVBF:
if(any(sel %in% "VBF")){
try( suppressWarnings( system(paste0(saga_cmd,
' ta_morphometry 8 -DEM=\"',
INPUT, '\" -MRVBF=\"',
gsub(".sgrd", "_vbf.sgrd", INPUT),
'\" -T_SLOPE=10 -P_SLOPE=3') ) ) )
}
## Valley depth:
if(any(sel %in% "VDP")){
try( suppressWarnings( system(paste0(saga_cmd,
' ta_channels 7 -ELEVATION=\"',
INPUT, '\" -VALLEY_DEPTH=\"',
gsub(".sgrd", "_vdepth.sgrd",
INPUT), '\"') ) ) )
}
## Openness:
if(any(sel %in% "OPN")){
try( suppressWarnings( system(paste0(saga_cmd,
' ta_lighting 5 -DEM=\"',
INPUT, '\" -POS=\"',
gsub(".sgrd", "_openp.sgrd", INPUT),
'\" -NEG=\"',
gsub(".sgrd", "_openn.sgrd", INPUT),
'\" -METHOD=0' ) ) ) )
}
## Deviation from Mean Value:
if(any(sel %in% "DVM")){
suppressWarnings( system(paste0(saga_cmd,
' statistics_grid 1 -GRID=\"',
INPUT, '\" -DEVMEAN=\"',
gsub(".sgrd", "_devmean.sgrd", INPUT),
'\" -RADIUS=11' ) ) )
}
}
```
To run this function we only need DEM as input:
```{r, eval=FALSE}
writeGDAL(eberg_grid["DEMSRT6"], "extdata/DEMSRT6.sdat", "SAGA")
saga_DEM_derivatives("DEMSRT6.sgrd")
```
which processes all DEM derivatives at once. We can plot them using:
```{r, eval=FALSE}
dem.lst <- list.files("extdata", pattern=glob2rx("^DEMSRT6_*.sdat"), full.names = TRUE)
plot(raster::stack(dem.lst), col=rev(magma(10, alpha = 0.8)))
```
```{r dem-derivatives-plot, echo=FALSE, fig.cap="Some standard DEM derivatives calculated using SAGA GIS.", out.width="90%"}
knitr::include_graphics("figures/dem_derivatives_plot.png")
```
This function can now be used with any DEM to derive a standard set of 7–8 DEM parameters consisting of slope and curvature, TWI and MrVBF, positive and negative openness, valley depth and deviation from mean value. You could easily add more parameters to this function and then test if some of the other DEM derivatives can help improve mapping soil properties and classes. Note that SAGA GIS will by default optimize computing of DEM derivatives by using most of the available cores to compute (parallelization is turned on automatically).
### Filtering out missing pixels and artifacts
After bringing all covariates to the same grid definition, remaining problems for using covariates in spatial modelling may include:
* Missing pixels,
* Artifacts and noise,
* Multicolinearity (i.e. data overlap),
In a stack with tens of rasters, the *weakest layer* (i.e. the layer with greatest number of missing pixels or largest number of artifacts) could cause serious problems for producing soil maps as the missing pixels and artifacts would propagate to predictions: if only one layer in the raster stack misses values then predictive models might drop whole rows in the predictions even though data is available for 95% of rows. Missing pixels occur for various reasons: in the case of remote sensing, missing pixels can be due to clouds or similar; noise is often due to atmospheric conditions. Missing pixels (as long as we are dealing with only a few patches of missing pixels) can be efficiently filtered by using for example the [gap filling functionality](http://saga-gis.org/saga_module_doc/2.2.7/grid_tools_7.html) available in the SAGA GIS e.g.:
```{r, eval=FALSE}
par(mfrow=c(1,2))
image(raster(eberg_grid["test"]), col=SAGA_pal[[1]], zlim=zlim, main="Original", asp=1)
image(raster("test.sdat"), col=SAGA_pal[[1]], zlim=zlim, main="Filtered", asp=1)
```
In this example we use the same input and output file for filling in gaps. There are several other gap filling possibilities in SAGA GIS including Close Gaps with Spline, Close Gaps with Stepwise Resampling and Close One Cell Gaps. Not all of these are equally applicable to all missing pixel problems, but having <10% of missing pixels is often not much of a problem for soil mapping.
Another elegant way to filter the missing pixels, to reduce noise and to reduce data overlap is to use [Principal Components](http://www.rdocumentation.org/packages/stats/functions/prcomp) transformation of original data. This is available also via the GSIF function `spc`:
```{r, eval=FALSE}
data(eberg_grid)
gridded(eberg_grid) <- ~x+y
proj4string(eberg_grid) <- CRS("+init=epsg:31467")
formulaString <- ~ PRMGEO6+DEMSRT6+TWISRT6+TIRAST6
eberg_spc <- GSIF::spc(eberg_grid, formulaString)
names(eberg_spc@predicted) # 11 components on the end;
```
```{r, eval=FALSE, echo=FALSE, results='hide', fig.keep='none'}
rd <- range(eberg_spc@predicted@data[,1], na.rm=TRUE)
plot(stack(eberg_spc@predicted[1:11]), zlim=rd, col=rev(viridis(10, alpha = 0.8)))
```
```{r eberg-spc-11-plot, echo=FALSE, fig.cap="11 PCs derived using input Ebergotzen covariates.", out.width="100%"}
knitr::include_graphics("figures/eberg_spc_11_plot.png")
```
The advantages of using the `spc` function are:
* All output soil covariates are numeric (and not a mixture of factors and numeric),
* The last 1-2 PCs often contain signal noise and could be excluded from modelling,
* In subsequent analysis it becomes easier to remove covariates that do not help in modelling (e.g. by using step-wise selection and similar),
A disadvantage of using `spc` is that these components are often abstract so that interpretation of correlations can become difficult. Also, if one of the layers contains many factor levels, then the number of output covariates might explode, which becomes impractical as we should then have at least 10 observations per covariate to avoid overfitting.
### Overlaying and subsetting raster stacks and points
Now that we have prepared all covariates (resampled them to the same grid and filtered out all problems), we can proceed with running overlays and fitting statistical models. Assuming that we deal with a large number of files, an elegant way to read all those into R is by using the raster package, especially the [`stack`](http://www.rdocumentation.org/packages/raster/functions/stack) and [`raster`](http://www.rdocumentation.org/packages/raster/functions/raster) commands. In the following example we can list all files of interest, and then read them all at once:
```{r, eval=FALSE}
library(raster)
grd.lst <- list.files(pattern="25m")
grd.lst
grid25m <- stack(grd.lst)
grid25m <- as(grid25m, "SpatialGridDataFrame")
str(grid25m)
```
One could now save all the prepared covariates stored in `SpatialGridDataFrame` as an RDS data object for future use.
```{r, eval=FALSE}
saveRDS(grid25m, file = "extdata/covariates25m.rds")
```
To overlay rasters and points and prepare a regression matrix, we can either use the [`over`](http://www.rdocumentation.org/packages/sp/functions/over) function from the sp package, or [`extract`](http://www.rdocumentation.org/packages/raster/functions/extract) function from the raster package. By using the raster package, one can run overlay even without reading the rasters into memory:
```{r, eval=FALSE}
library(sp)
data(eberg)
coordinates(eberg) <- ~X+Y
proj4string(eberg) <- CRS("+init=epsg:31467")
ov <- as.data.frame(extract(stack(grd.lst), eberg))
str(ov[complete.cases(ov),])
```
If the raster layers can not be stacked and if each layer is available in a different projection system, you can also create a function that reprojects points to the target raster layer projection system:
```{r, eval=FALSE}
overlay.fun <- function(i, y){
raster::extract(raster(i), na.rm=FALSE,
spTransform(y, proj4string(raster(i))))}
```
which can also be run in parallel for example by using the parallel package:
```{r, eval=FALSE}
ov <- data.frame(mclapply(grd.lst, FUN=overlay.fun, y=eberg))
names(ov) <- basename(grd.lst)
```
In a similar way, one could also make wrapper functions that downscale/upscale grids, then filter missing values and stack all data together so that it becomes available in the working memory (sp grid or pixels object). Overlay and model fitting is also implemented directly in the GSIF package, so any attempt to fit models will automatically perform overlay.
### Working with large(r) rasters
As R is often inefficient in handling large objects in memory (such as large raster images), a good strategy to run raster processing in R is to consider using for example the `clusterR` function from the [raster](https://cran.r-project.org/package=raster) package, which automatically parallelizes use of raster functions. To have full control over parallelization, you can alternatively tile large rasters using the `getSpatialTiles` function from the GSIF package and process them as separate objects in parallel. The following examples show how to run a simple function in parallel on tiles and then mosaic these tiles after all processing has been completed. Consider for example the sample GeoTiff from the rgdal package:
```{r}
fn = system.file("pictures/SP27GTIF.TIF", package = "rgdal")
obj <- rgdal::GDALinfo(fn)
```
We can split that object in 35 tiles, each of 5 x 5 km in size by running:
```{r, eval=FALSE, fig.keep='none'}
tiles <- GSIF::getSpatialTiles(obj, block.x=5000, return.SpatialPolygons = FALSE)
tiles.pol <- GSIF::getSpatialTiles(obj, block.x=5000, return.SpatialPolygons = TRUE)
tile.pol <- SpatialPolygonsDataFrame(tiles.pol, tiles)
plot(raster(fn), col=bpy.colors(20))
lines(tile.pol, lwd=2)
```
```{r rplot-large-raster-tiles, echo=FALSE, fig.cap="Example of a tiling system derived using the GSIF::getSpatialTiles function.", out.width="60%"}
knitr::include_graphics("figures/rplot_large_raster_tiles.png")
```
rgdal further allows us to read only a single tile of the GeoTiff by using the `offset` and `region.dim` arguments:
```{r, eval=FALSE}
x = readGDAL(fn, offset=unlist(tiles[1,c("offset.y","offset.x")]),
region.dim=unlist(tiles[1,c("region.dim.y","region.dim.x")]),
output.dim=unlist(tiles[1,c("region.dim.y","region.dim.x")]), silent = TRUE)
spplot(x)
```
```{r sp27gtif-tile, echo=FALSE, fig.cap="A tile produced from a satellite image in the example in the previous figure.", out.width="60%"}
knitr::include_graphics("figures/sp27gtif_tile.png")
```
We would like to run a function on this raster in parallel, for example a simple function that converts values to 0/1 values based on a threshold:
```{r}
fun_mask <- function(i, tiles, dir="./tiled/", threshold=190){
out.tif = paste0(dir, "T", i, ".tif")
if(!file.exists(out.tif)){
x = readGDAL(fn, offset=unlist(tiles[i,c("offset.y","offset.x")]),
region.dim=unlist(tiles[i,c("region.dim.y","region.dim.x")]),
output.dim=unlist(tiles[i,c("region.dim.y","region.dim.x")]),
silent = TRUE)
x$mask = ifelse(x$band1>threshold, 1, 0)
writeGDAL(x["mask"], type="Byte", mvFlag = 255,
out.tif, options=c("COMPRESS=DEFLATE"))
}
}
```
This can now be run through the `mclapply` function from the parallel package (which automatically employs all available cores):
```{r, eval=FALSE}
x0 <- mclapply(1:nrow(tiles), FUN=fun_mask, tiles=tiles)
```
If we look in the tiles folder, this should show 35 newly produced GeoTiffs. These can be further used to construct a virtual mosaic by using:
```{r, eval=FALSE}
t.lst <- list.files(path="extdata/tiled", pattern=glob2rx("^T*.tif$"),
full.names=TRUE, recursive=TRUE)
cat(t.lst, sep="\n", file="SP27GTIF_tiles.txt")
system('gdalbuildvrt -input_file_list SP27GTIF_tiles.txt SP27GTIF.vrt')
system('gdalwarp SP27GTIF.vrt SP27GTIF_mask.tif -ot \"Byte\"',
' -dstnodata 255 -co \"BIGTIFF=YES\" -r \"near\" -overwrite -co \"COMPRESS=DEFLATE\"')
```
Note we use a few important settings here for GDAL e.g.
`-overwrite -co "COMPRESS=DEFLATE"` to overwrite the GeoTiff and internally
compress it to save space, and `-r "near"` basically specifies that
no resampling is applied, just binding of tiles together. Also, if the
output GeoTiff is HUGE, you will most likely have to turn on `-co "BIGTIFF=YES"`
otherwise `gdalwarp` would not run through. The output mosaic looks like this:
```{r sp27gtif-mask, echo=FALSE, fig.cap="Final processed output.", out.width="60%"}
knitr::include_graphics("figures/sp27gtif_mask.png")
```
This demonstrates that R can be used to compute with large rasters, provided
that these operations can be parallelized. Suggested best practice for this
is to: (1) design a tiling system that optimizes use of RAM and read/write
speed of a disk, (2) prepare and test a function that can then be run in
parallel, and (3) stitch back all tiles to create a single large raster
using `gdalwarp`.
Note that such *tiling* and *stitching* can not be applied universally to all
problems e.g. functions that require global geographical search or all data
in the raster. In such cases tiling should be applied with overlap
(to minimize boundary effects) or to irregular tiling systems
(e.g. per watershed). Once an optimal tiling system and function is prepared,
R is no longer limited to running efficient computing, but only dependent on
how much RAM and how many cores you have available i.e. it becomes more of a
hardware than a software problem.
## Summary points
Soil covariate layers are one of the key inputs to predictive soil mapping.
Before any spatial layer can be used for modeling, it typically needs to be
preprocessed to remove artifacts, resample to a standard resolution, fill in any
missing values etc. All these operations can be successfully run by combining R
and Open Source GIS software and by careful programming and optimization.
Preparing soil covariates can often be time and resources consuming so careful
preparation and prioritization of processing is highly recommended.
@Hengl2017SoilGrids250m show that, for soil types and soil textures, DEM-parameters,
i.e. soil forming factors of relief, especially flow-based DEM-indices,
emerge as the second-most dominant covariates. These results largely
correspond with conventional soil survey knowledge (surveyors have been
using relief as a key guideline to delineate soil bodies for decades).
Although lithology is often not in the list of the top 15 most important
predictors for PSM projects, spatial patterns of lithologic classes can often be
distinctly recognized in the output predictions. This is especially true
for soil texture fractions and coarse fragments. In general, for
predicting soil chemical properties, climatic variables (especially
precipitation) and surface reflectance seem to be the most important,
while for soil classes and soil physical properties it is a combination
of relief, vegetation dynamics and parent material. Investing extra time
in preparing a better map of soil parent material is hence often a good idea.
Other potentially useful covariates for predicting soil properties and
classes could be maps of paleo i.e. pre-historic climatic
conditions of soil formation, e.g. glacial landscapes and processes,
past climate conditions and similar. These could likely become
significant predictors of many current soil characteristics. Information
on pre-historic climatic conditions and land use is unfortunately often
not available, especially not at detailed cartographic scales, although
there are now several global products that represent, for example,
dynamics of land use / changes of land cover (see e.g. HYDE data set [@klein2011hyde])
through the past 1500+ years. As the spatial detail and completeness of
such pre-historic maps increases, they will become potentially
interesting covariates for global soil modeling.
USA’s NASA and USGS, with its SRTM, MODIS, Landsat, ICESat and similar
civil-applications missions will likely
remain the main source of spatial covariate data to support global
and local soil mapping initiatives.