sandwichr
R
PackageThe sandwichr
package performs spatial interpolation1 based on
the spatial stratified heterogeneity (SSH) theory2. Three inputs are
required for this model: sampling, SSH, and reporting layers. By
utilizing the known distribution of the population in the SSH layer,
this model seeks to estimate the mean value of the sampling attribute
and its standard error for each reporting unit. The Sandwich model is
still applicable when the spatial dependence is weak because it does not
rely on the spatial dependence, provided that the geospatial surface is
properly stratified.
First, we need to install the sandwichr
package from CRAN (only do this once).
# Install the sandwichr package
# install.packages("sandwichr")
Then use library()
to load the installed package into
your environment. Remember to load the package every time you use
it.
# Import the sandwichr package and other packages
library("sandwichr")
library(ggplot2)
library(ggpubr)
library(dplyr)
library(ape)
Here,we present two case studies to demonstrate the functionality of this package.
In the first case study, interpolation is performed on a simulated geospatial surface. The population data is organized as a 20*20 grid. This population is divided into four continuous strata. We generate 41 sampling units at random within the grid. Based on the sample, we would really like to infer the values of seven reporting units that make up the surface. The sampling, SSH, and reporting layers are all in the shapefile format.
To successfully run the model, the load.data.shp
function is used to convert the shapefiles of sampling, stratification,
and reporting layers into a list of sf
(simple feature; see
Simple Features for
R for references) objects for model input. You can specify the
directory and names for your input files. It should be noted that these
input files must be located in the same directory.
# Input data from shapefiles
<- system.file("extdata", "sim.sampling.shp",
sim.sampling.name package="sandwichr")
<- system.file("extdata", "sim.ssh.shp",
sim.ssh.name package="sandwichr")
<- system.file("extdata", "sim.reporting.shp",
sim.reporting.name package="sandwichr")
<- load.data.shp(sampling.file=sim.sampling.name,
sim.data ssh.file=sim.ssh.name,
reporting.file=sim.reporting.name)
# Sampling
head(sim.data[[1]])
## Simple feature collection with 6 features and 4 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 0.5549496 ymin: 2.956588 xmax: 3.99286 ymax: 5.2793
## Geodetic CRS: WGS 84
## # A tibble: 6 x 5
## id Value Lon Lat geometry
## <dbl> <dbl> <dbl> <dbl> <POINT [°]>
## 1 0 401. 3.51 4.50 (3.506225 4.504105)
## 2 1 209. 1.48 2.96 (1.478404 2.956588)
## 3 2 412. 3.99 5.28 (3.99286 5.2793)
## 4 3 310. 0.666 3.15 (0.6659313 3.152236)
## 5 4 344. 2.39 3.70 (2.393734 3.699795)
## 6 5 111. 0.555 5.16 (0.5549496 5.161923)
class(sim.data[[1]])
## [1] "sf" "tbl_df" "tbl" "data.frame"
attributes(sim.data[[1]])
## $names
## [1] "id" "Value" "Lon" "Lat" "geometry"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
##
## $class
## [1] "sf" "tbl_df" "tbl" "data.frame"
##
## $sf_column
## [1] "geometry"
##
## $agr
## id Value Lon Lat
## <NA> <NA> <NA> <NA>
## Levels: constant aggregate identity
# Stratification
head(sim.data[[2]])
## Simple feature collection with 4 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 0 ymin: 2 xmax: 4 ymax: 6
## Geodetic CRS: WGS 84
## # A tibble: 4 x 2
## X geometry
## <dbl> <POLYGON [°]>
## 1 1 ((0.2 5, 0 5, 0 5.2, 0 5.4, 0 5.6, 0 5.8, 0 6, 0.2 6, 0.4 6, 0.6 6, 0.8~
## 2 3 ((0.8 3.8, 0.8 3.6, 1 3.6, 1.2 3.6, 1.4 3.6, 1.4 3.4, 1.4 3.2, 1.4 3, 1~
## 3 2 ((3.8 3.6, 3.8 3.4, 3.8 3.2, 3.8 3, 3.6 3, 3.4 3, 3.2 3, 3 3, 2.8 3, 2.~
## 4 4 ((4 2.8, 4 2.6, 4 2.4, 4 2.2, 4 2, 3.8 2, 3.6 2, 3.4 2, 3.2 2, 3 2, 2.8~
class(sim.data[[2]])
## [1] "sf" "tbl_df" "tbl" "data.frame"
attributes(sim.data[[2]])
## $names
## [1] "X" "geometry"
##
## $row.names
## [1] 1 2 3 4
##
## $class
## [1] "sf" "tbl_df" "tbl" "data.frame"
##
## $sf_column
## [1] "geometry"
##
## $agr
## X
## <NA>
## Levels: constant aggregate identity
# Reporting
head(sim.data[[3]])
## Simple feature collection with 6 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.684342e-14 ymin: 2 xmax: 4 ymax: 6
## Geodetic CRS: WGS 84
## # A tibble: 6 x 2
## Y geometry
## <dbl> <POLYGON [°]>
## 1 1 ((0.6 4, 0.6 5, 0.8 5, 1 5, 1.2 5, 1.4 5, 1.6 5, 1.8 5, 2 5, 2 4.8, 2 4~
## 2 3 ((1 3.4, 1 3.2, 1 3, 1.2 3, 1.4 3, 1.6 3, 1.6 2.8, 1.8 2.8, 2 2.8, 2.2 ~
## 3 2 ((2.4 3.6, 1.6 3.6, 1.6 3.8, 1.6 4, 1.8 4, 2 4, 2 4.2, 2 4.4, 2 4.6, 2 ~
## 4 4 ((4 3.8, 4 3.6, 4 3.4, 4 3.2, 3.8 3.2, 3.8 3, 3.8 2.8, 3.6 2.8, 3.6 3, ~
## 5 5 ((1 3.4, 1 4, 1.2 4, 1.4 4, 1.6 4, 1.6 3.8, 1.6 3.6, 1.6 3.4, 1.4 3.4, ~
## 6 6 ((1.6 3.6, 2.4 3.6, 2.4 3.4, 2.4 3.2, 2.4 3, 2.4 2.8, 2.2 2.8, 2 2.8, 1~
class(sim.data[[3]])
## [1] "sf" "tbl_df" "tbl" "data.frame"
attributes(sim.data[[3]])
## $names
## [1] "Y" "geometry"
##
## $row.names
## [1] 1 2 3 4 5 6 7
##
## $class
## [1] "sf" "tbl_df" "tbl" "data.frame"
##
## $sf_column
## [1] "geometry"
##
## $agr
## Y
## <NA>
## Levels: constant aggregate identity
We use the Moran’s I to evaluate the spatial dependence of the simulated data.
<- as.matrix(dist(cbind(sim.data[[1]]$Lon, sim.data[[1]]$Lat)))
sim.dists <- 1/sim.dists
sim.dists.inv diag(sim.dists.inv) <- 0
Moran.I(sim.data[[1]]$Value, sim.dists.inv)
## $observed
## [1] 0.234028
##
## $expected
## [1] -0.025
##
## $sd
## [1] 0.03428651
##
## $p.value
## [1] 4.196643e-14
The accuracy of SSH-based spatial interpolation is determined by the stratification layer. In an ideal stratification layer, values of the target attribute are expected to be homogeneous within each stratum and to differ between the strata. To determine a proper stratification layer for the Sandwich model, the geographical detector model3 is applied to quantify the SSH of the target attribute with regard to the candidate stratification(s).
First, we combine the sampling and candidate stratification layers into a single data frame.
# Prepare the stratification layer(s) for evaluation
<- ssh.data.shp(object=sim.data[[1]], ssh.lyr=sim.data[[2]], ssh.id="X")
sim.join head(sim.join)
## Simple feature collection with 6 features and 5 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 0.5549496 ymin: 2.956588 xmax: 3.99286 ymax: 5.2793
## Geodetic CRS: WGS 84
## # A tibble: 6 x 6
## id Value Lon Lat geometry X
## <dbl> <dbl> <dbl> <dbl> <POINT [°]> <dbl>
## 1 0 401. 3.51 4.50 (3.506225 4.504105) 2
## 2 1 209. 1.48 2.96 (1.478404 2.956588) 4
## 3 2 412. 3.99 5.28 (3.99286 5.2793) 2
## 4 3 310. 0.666 3.15 (0.6659313 3.152236) 3
## 5 4 344. 2.39 3.70 (2.393734 3.699795) 2
## 6 5 111. 0.555 5.16 (0.5549496 5.161923) 1
The factor detector q-statistic in the geographical detector
model is applied through the ssh.test
function to measure
the SSH of the sampling data in terms of different stratifications. This
function takes a data frame generated by the ssh.data.shp
(or ssh.data.txt
, which will be introduced later) function
(argument object
), where the sampling attribute (argument
y
) and the strata ID(s) (argument x
) must be
specified. In this example, the output of ssh.test
(q = .81) implies that the candidate stratification layer has a
high determinant power over the attribute. Therefore, it will be
reasonable to select the stratification layer for subsequent
interpolation.
# Calculate the geographical detector q-statistic
ssh.test(object=sim.join, y="Value", x=c("X"), test="factor")
## [[1]]
## q-statistic p-value
## X 0.9212376 3.764869e-10
We can visualize the mean and standard deviation of the sample in each stratum.
= ggerrorplot(sim.join, x = "X", y = "Value",
p desc_stat = "mean_sd", color = "black",
add = "violin", add.params = list(color = "darkgray")
)
+ theme(axis.title.x = element_blank()) p
%>%
sim.join group_by(X) %>%
summarise_at(vars(Value),
list(name = mean))
## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOINT
## Dimension: XY
## Bounding box: xmin: 0.05899652 ymin: 2.041932 xmax: 3.99286 ymax: 5.921005
## Geodetic CRS: WGS 84
## # A tibble: 4 x 3
## X name geometry
## <dbl> <dbl> <MULTIPOINT [°]>
## 1 1 93.8 ((0.4581337 5.024432), (0.5549496 5.161923), (0.4016744 5.921005)~
## 2 2 413. ((2.92014 3.070043), (2.744322 3.482087), (3.207056 3.377478), (3~
## 3 3 312. ((0.4727599 2.430498), (0.4892411 2.141553), (1.774888 2.379797),~
## 4 4 207. ((3.783199 2.041932), (3.495714 2.111997), (3.113503 2.943583), (~
%>%
sim.join group_by(X) %>%
summarise(mean=mean(Value), sd=sd(Value), n=n())
## Simple feature collection with 4 features and 4 fields
## Geometry type: MULTIPOINT
## Dimension: XY
## Bounding box: xmin: 0.05899652 ymin: 2.041932 xmax: 3.99286 ymax: 5.921005
## Geodetic CRS: WGS 84
## # A tibble: 4 x 5
## X mean sd n geometry
## <dbl> <dbl> <dbl> <int> <MULTIPOINT [°]>
## 1 1 93.8 24.4 9 ((0.4581337 5.024432), (0.5549496 5.161923), (0.40167~
## 2 2 413. 38.7 17 ((2.92014 3.070043), (2.744322 3.482087), (3.207056 3~
## 3 3 312. 49.5 11 ((0.4727599 2.430498), (0.4892411 2.141553), (1.77488~
## 4 4 207. 18.8 4 ((3.783199 2.041932), (3.495714 2.111997), (3.113503 ~
The Sandwich mapping model is performed using the
sandwich.model
function, which returns the mean value of
the sampling attribute and its standard error for each reporting
unit.
# Perform the SSH based spatial interpolation
<- sandwich.model(object=sim.data, sampling.attr="Value", type="shp")
sim.sw head(sim.sw$object)
## Simple feature collection with 6 features and 4 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.684342e-14 ymin: 2 xmax: 4 ymax: 6
## Geodetic CRS: WGS 84
## # A tibble: 6 x 5
## Y geometry mean se df
## <dbl> <POLYGON [°]> <dbl> <dbl> <dbl>
## 1 1 ((0.6 4, 0.6 5, 0.8 5, 1 5, 1.2 5, 1.4 5, 1.6 5, 1.8 ~ 381. 2.61 34
## 2 3 ((1 3.4, 1 3.2, 1 3, 1.2 3, 1.4 3, 1.6 3, 1.6 2.8, 1.~ 298. 3.05 29
## 3 2 ((2.4 3.6, 1.6 3.6, 1.6 3.8, 1.6 4, 1.8 4, 2 4, 2 4.2~ 262. 2.27 24
## 4 4 ((4 3.8, 4 3.6, 4 3.4, 4 3.2, 3.8 3.2, 3.8 3, 3.8 2.8~ 401. 3.02 19
## 5 5 ((1 3.4, 1 4, 1.2 4, 1.4 4, 1.6 4, 1.6 3.8, 1.6 3.6, ~ 390. 2.74 26
## 6 6 ((1.6 3.6, 2.4 3.6, 2.4 3.4, 2.4 3.2, 2.4 3, 2.4 2.8,~ 357. 3.02 29
summary(sim.sw)
## [1] "Sample size: 41"
## [1] "Number of SSH strata: 4"
## [1] "Number of reporting units: 7"
## mean se geometry
## Min. :202.8 Min. :1.668 POLYGON :7
## 1st Qu.:280.0 1st Qu.:2.435 epsg:4326 :0
## Median :357.1 Median :2.742 +proj=long...:0
## Mean :327.4 Mean :2.625
## 3rd Qu.:385.8 3rd Qu.:3.022
## Max. :400.6 Max. :3.050
The interpolated values and standard errors can be visualized using
plot.mean
and plot.se()
.
# Plot the estimated mean values and standard errors
::autoplot(object=sim.sw) ggplot2
To evaluate the overall accuracy of the Sandwich mapping model, a
stratified k-fold cross validation is performed using the
sandwich.cv
function. A diagnostic statistic called
cross-validation estimate (CVE) will be calculated. Here, we present the
result of a stratified 5-fold cross validation.
# Perform k-fold cross validation
set.seed(0)
<- sandwich.cv(object=sim.data, sampling.attr="Value", k=5, type="shp", ssh.id.col="X")
sim.cv sim.cv
## [1] 110.4208
The second case study interpolates the breast cancer incidence across mainland China based on 242 samples retrieved from the Chinese Cancer Registry Annual Report. The urban-rural classification (either urban or rural areas are typically not connected in space) is used as the stratification layer. County-level divisions are used as the reporting units. In this example, all the input data are in csv format.
The load.data.txt
function is used to prepare text files
into a list of data frames for model input. Two files are needed for
this function: one linking sampling and stratification layers, and the
other linking reporting and stratification layers.
# Input data from text files
<- system.file("extdata", "bc_sampling_ssh.csv",
bc.sampling_ssh.name package="sandwichr")
<- system.file("extdata", "bc_reporting_ssh.csv",
bc.reporting_ssh.name package="sandwichr")
<- load.data.txt(sampling_ssh.file=bc.sampling_ssh.name,
bc.data reporting_ssh.file=bc.reporting_ssh.name)
# Sampling-stratification
head(bc.data[[1]])
## GBCODE Incidence SSHID X Y
## 1 110100 42 1 115.9642 39.86494
## 2 120100 44 1 117.3913 39.01184
## 3 120225 26 2 117.4309 40.00251
## 4 130129 9 2 114.2797 37.61767
## 5 130181 36 1 115.2870 37.91756
## 6 130227 20 2 118.3576 40.23177
class(bc.data[[1]])
## [1] "data.frame"
# Reporting-stratification
head(bc.data[[2]])
## GBCODE W1 W2
## 1 110100 0.8736000 0.1264000
## 2 110112 0.6115000 0.3885000
## 3 110113 0.5378000 0.4622000
## 4 110221 0.5245597 0.4754403
## 5 110224 0.5245597 0.4754403
## 6 110226 0.5245597 0.4754403
class(bc.data[[2]])
## [1] "data.frame"
We use the Moran’s I to evaluate the SAC of the breast cancer incidence.
<- as.matrix(dist(cbind(bc.data[[1]]$X, bc.data[[1]]$Y)))
bc.dists <- 1/bc.dists
bc.dists.inv diag(bc.dists.inv) <- 0
Moran.I(bc.data[[1]]$Incidence, bc.dists.inv)
## $observed
## [1] 0.04940197
##
## $expected
## [1] -0.004149378
##
## $sd
## [1] 0.008379477
##
## $p.value
## [1] 1.650469e-10
First, we use the ssh.data.txt
function to convert the
input data for evaluation.
# Prepare the stratification layer for evaluation
<- ssh.data.txt(object=bc.data)
bc.join head(bc.join)
## GBCODE Incidence SSHID X Y
## 1 110100 42 1 115.9642 39.86494
## 2 120100 44 1 117.3913 39.01184
## 3 120225 26 2 117.4309 40.00251
## 4 130129 9 2 114.2797 37.61767
## 5 130181 36 1 115.2870 37.91756
## 6 130227 20 2 118.3576 40.23177
The factor detector q-statistic in the geographical detector
model is applied through the ssh.test
function to measure
the SSH of the sample regarding the urban-rural classification. In this
example, the output of ssh.test
implies a relatively high
level of SSH in the distribution of breast cancer incidence concerning
the given stratification (q = .52). Therefore, it is reasonable
to select the urban-rural classification as the stratification layer for
subsequent interpolation.
# Calculate the geographical detector q-statistic
ssh.test(object=bc.join, y="Incidence", x="SSHID", test="factor", type="txt")
## [[1]]
## q-statistic p-value
## SSHID 0.5165114 7.219864e-11
We can visualize the mean and standard deviation of the sampled breast cancer incidence in each stratum.
= ggerrorplot(bc.data[[1]], x = "SSHID", y = "Incidence",
p desc_stat = "mean_sd", color = "black",
add = "violin", add.params = list(color = "darkgray")
)
+ scale_x_discrete(labels=c("1" = "Urban", "2" = "Rural")) +
p theme(axis.title.x = element_blank()) + labs(y="Breast Cancer Incidence (Rate per 100,000)")
1]] %>%
bc.data[[group_by(SSHID) %>%
summarise_at(vars(Incidence),
list(name = mean))
## # A tibble: 2 x 2
## SSHID name
## <int> <dbl>
## 1 1 31.6
## 2 2 17.5
1]] %>%
bc.data[[group_by(SSHID) %>%
summarise(mean=mean(Incidence), sd=sd(Incidence), n=n())
## # A tibble: 2 x 4
## SSHID mean sd n
## <int> <dbl> <dbl> <int>
## 1 1 31.6 8.07 125
## 2 2 17.5 5.12 117
The Sandwich mapping model is performed using the
sandwich.model
function, which outputs the mean value of
the sampling attribute and its standard error for each reporting
unit.
# Perform the SSH based spatial interpolation
<- sandwich.model(object=bc.data, sampling.attr="Incidence", type="txt",
bc.sw ssh.id.col="SSHID", ssh.weights=list(c(1,2), c("W1","W2")))
head(bc.sw$object)
## GBCODE W1 W2 mean se df
## 1 110100 0.8736000 0.1264000 29.78768 0.7965111 240
## 2 110112 0.6115000 0.3885000 26.11263 0.6894776 240
## 3 110113 0.5378000 0.4622000 25.07924 0.6604525 240
## 4 110221 0.5245597 0.4754403 24.89360 0.6552986 240
## 5 110224 0.5245597 0.4754403 24.89360 0.6552986 240
## 6 110226 0.5245597 0.4754403 24.89360 0.6552986 240
summary(bc.sw)
## [1] "Sample size: 242"
## [1] "Number of SSH strata: 2"
## [1] "Number of reporting units: 2352"
## mean se
## Min. :17.54 Min. :0.4732
## 1st Qu.:21.22 1st Qu.:0.5579
## Median :22.38 Median :0.5876
## Mean :22.73 Mean :0.5986
## 3rd Qu.:23.86 3rd Qu.:0.6270
## Max. :31.56 Max. :0.8498
To evaluate the overall accuracy of SSH-based spatial interpolation,
a k-fold cross validation can be performed using the
sandwich.cv
function. A diagnostic statistic called root
mean square error (RMSE) will be calculated. Here, we present the result
of a 5-fold cross validation.
# Perform k-fold cross validation
set.seed(0)
<- sandwich.cv(object=bc.data, sampling.attr="Incidence", k=5, type="txt",
bc.cv ssh.id.col="SSHID", reporting.id.col="GBCODE",
ssh.weights=list(c(1,2), c("W1","W2")))
bc.cv
## [1] 8.680166
Wang, J. F., Haining, R., Liu, T. J., Li, L. F., & Jiang, C. S. (2013). Sandwich estimation for multi-unit reporting on a stratified heterogeneous surface. Environment and Planning A, 45(10), 2515-2534.↩︎
Wang, J. F., Haining, R., Liu, T. J., Li, L. F., & Jiang, C. S. (2013). Sandwich estimation for multi-unit reporting on a stratified heterogeneous surface. Environment and Planning A, 45(10), 2515-2534.↩︎
Wang, J. F., Li, X. H., Christakos, G., Liao, Y. L., Zhang, T., Gu, X., & Zheng, X. Y. (2010). Geographical detectors-based health risk assessment and its application in the neural tube defects study of the Heshun Region, China. International Journal of Geographical Information Science, 24(1), 107-127.↩︎