A Tutorial for the sandwichr R Package

The sandwichr package performs spatial interpolation¹ based on the spatial stratified heterogeneity (SSH) theory². Three inputs are required for this model: sampling, SSH, and reporting layers. By utilizing the known distribution of the population in the SSH layer, this model seeks to estimate the mean value of the sampling attribute and its standard error for each reporting unit. The Sandwich model is still applicable when the spatial dependence is weak because it does not rely on the spatial dependence, provided that the geospatial surface is properly stratified.

First, we need to install the sandwichr package from CRAN (only do this once).

Then use library() to load the installed package into your environment. Remember to load the package every time you use it.

Here,we present two case studies to demonstrate the functionality of this package.

Simulated data

In the first case study, interpolation is performed on a simulated geospatial surface. The population data is organized as a 20*20 grid. This population is divided into four continuous strata. We generate 41 sampling units at random within the grid. Based on the sample, we would really like to infer the values of seven reporting units that make up the surface. The sampling, SSH, and reporting layers are all in the shapefile format.

Loading data

To successfully run the model, the load.data.shp function is used to convert the shapefiles of sampling, stratification, and reporting layers into a list of sf (simple feature; see Simple Features for R for references) objects for model input. You can specify the directory and names for your input files. It should be noted that these input files must be located in the same directory.

# Input data from shapefiles
sim.sampling.name <- system.file("extdata", "sim.sampling.shp", 
                                package="sandwichr")
sim.ssh.name <- system.file("extdata", "sim.ssh.shp", 
                           package="sandwichr")
sim.reporting.name <- system.file("extdata", "sim.reporting.shp", 
                                 package="sandwichr")

sim.data <- load.data.shp(sampling.file=sim.sampling.name, 
                      ssh.file=sim.ssh.name,
                      reporting.file=sim.reporting.name)

# Sampling
head(sim.data[[1]])

## Simple feature collection with 6 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 0.5549496 ymin: 2.956588 xmax: 3.99286 ymax: 5.2793
## Geodetic CRS:  WGS 84
## # A tibble: 6 x 5
##      id Value   Lon   Lat             geometry
##   <dbl> <dbl> <dbl> <dbl>          <POINT [°]>
## 1     0  401. 3.51   4.50  (3.506225 4.504105)
## 2     1  209. 1.48   2.96  (1.478404 2.956588)
## 3     2  412. 3.99   5.28     (3.99286 5.2793)
## 4     3  310. 0.666  3.15 (0.6659313 3.152236)
## 5     4  344. 2.39   3.70  (2.393734 3.699795)
## 6     5  111. 0.555  5.16 (0.5549496 5.161923)

class(sim.data[[1]])

## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

attributes(sim.data[[1]])

## $names
## [1] "id"       "Value"    "Lon"      "Lat"      "geometry"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
## 
## $class
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"
## 
## $sf_column
## [1] "geometry"
## 
## $agr
##    id Value   Lon   Lat 
##  <NA>  <NA>  <NA>  <NA> 
## Levels: constant aggregate identity

# Stratification
head(sim.data[[2]])

## Simple feature collection with 4 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 0 ymin: 2 xmax: 4 ymax: 6
## Geodetic CRS:  WGS 84
## # A tibble: 4 x 2
##       X                                                                 geometry
##   <dbl>                                                            <POLYGON [°]>
## 1     1 ((0.2 5, 0 5, 0 5.2, 0 5.4, 0 5.6, 0 5.8, 0 6, 0.2 6, 0.4 6, 0.6 6, 0.8~
## 2     3 ((0.8 3.8, 0.8 3.6, 1 3.6, 1.2 3.6, 1.4 3.6, 1.4 3.4, 1.4 3.2, 1.4 3, 1~
## 3     2 ((3.8 3.6, 3.8 3.4, 3.8 3.2, 3.8 3, 3.6 3, 3.4 3, 3.2 3, 3 3, 2.8 3, 2.~
## 4     4 ((4 2.8, 4 2.6, 4 2.4, 4 2.2, 4 2, 3.8 2, 3.6 2, 3.4 2, 3.2 2, 3 2, 2.8~

class(sim.data[[2]])

## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

attributes(sim.data[[2]])

## $names
## [1] "X"        "geometry"
## 
## $row.names
## [1] 1 2 3 4
## 
## $class
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"
## 
## $sf_column
## [1] "geometry"
## 
## $agr
##    X 
## <NA> 
## Levels: constant aggregate identity

# Reporting
head(sim.data[[3]])

## Simple feature collection with 6 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 5.684342e-14 ymin: 2 xmax: 4 ymax: 6
## Geodetic CRS:  WGS 84
## # A tibble: 6 x 2
##       Y                                                                 geometry
##   <dbl>                                                            <POLYGON [°]>
## 1     1 ((0.6 4, 0.6 5, 0.8 5, 1 5, 1.2 5, 1.4 5, 1.6 5, 1.8 5, 2 5, 2 4.8, 2 4~
## 2     3 ((1 3.4, 1 3.2, 1 3, 1.2 3, 1.4 3, 1.6 3, 1.6 2.8, 1.8 2.8, 2 2.8, 2.2 ~
## 3     2 ((2.4 3.6, 1.6 3.6, 1.6 3.8, 1.6 4, 1.8 4, 2 4, 2 4.2, 2 4.4, 2 4.6, 2 ~
## 4     4 ((4 3.8, 4 3.6, 4 3.4, 4 3.2, 3.8 3.2, 3.8 3, 3.8 2.8, 3.6 2.8, 3.6 3, ~
## 5     5 ((1 3.4, 1 4, 1.2 4, 1.4 4, 1.6 4, 1.6 3.8, 1.6 3.6, 1.6 3.4, 1.4 3.4, ~
## 6     6 ((1.6 3.6, 2.4 3.6, 2.4 3.4, 2.4 3.2, 2.4 3, 2.4 2.8, 2.2 2.8, 2 2.8, 1~

class(sim.data[[3]])

## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

attributes(sim.data[[3]])

## $names
## [1] "Y"        "geometry"
## 
## $row.names
## [1] 1 2 3 4 5 6 7
## 
## $class
## [1] "sf"         "tbl_df"     "tbl"        "data.frame"
## 
## $sf_column
## [1] "geometry"
## 
## $agr
##    Y 
## <NA> 
## Levels: constant aggregate identity

Calculating Moran’s I

We use the Moran’s I to evaluate the spatial dependence of the simulated data.

sim.dists <- as.matrix(dist(cbind(sim.data[[1]]$Lon, sim.data[[1]]$Lat)))
sim.dists.inv <- 1/sim.dists
diag(sim.dists.inv) <- 0
Moran.I(sim.data[[1]]$Value, sim.dists.inv)

## $observed
## [1] 0.234028
## 
## $expected
## [1] -0.025
## 
## $sd
## [1] 0.03428651
## 
## $p.value
## [1] 4.196643e-14

Selection of the stratification layer(s)

The accuracy of SSH-based spatial interpolation is determined by the stratification layer. In an ideal stratification layer, values of the target attribute are expected to be homogeneous within each stratum and to differ between the strata. To determine a proper stratification layer for the Sandwich model, the geographical detector model³ is applied to quantify the SSH of the target attribute with regard to the candidate stratification(s).

First, we combine the sampling and candidate stratification layers into a single data frame.

# Prepare the stratification layer(s) for evaluation
sim.join <- ssh.data.shp(object=sim.data[[1]], ssh.lyr=sim.data[[2]], ssh.id="X")
head(sim.join)

## Simple feature collection with 6 features and 5 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 0.5549496 ymin: 2.956588 xmax: 3.99286 ymax: 5.2793
## Geodetic CRS:  WGS 84
## # A tibble: 6 x 6
##      id Value   Lon   Lat             geometry     X
##   <dbl> <dbl> <dbl> <dbl>          <POINT [°]> <dbl>
## 1     0  401. 3.51   4.50  (3.506225 4.504105)     2
## 2     1  209. 1.48   2.96  (1.478404 2.956588)     4
## 3     2  412. 3.99   5.28     (3.99286 5.2793)     2
## 4     3  310. 0.666  3.15 (0.6659313 3.152236)     3
## 5     4  344. 2.39   3.70  (2.393734 3.699795)     2
## 6     5  111. 0.555  5.16 (0.5549496 5.161923)     1

The factor detector q-statistic in the geographical detector model is applied through the ssh.test function to measure the SSH of the sampling data in terms of different stratifications. This function takes a data frame generated by the ssh.data.shp (or ssh.data.txt, which will be introduced later) function (argument object), where the sampling attribute (argument y) and the strata ID(s) (argument x) must be specified. In this example, the output of ssh.test (q = .81) implies that the candidate stratification layer has a high determinant power over the attribute. Therefore, it will be reasonable to select the stratification layer for subsequent interpolation.

# Calculate the geographical detector q-statistic
ssh.test(object=sim.join, y="Value", x=c("X"), test="factor")

## [[1]]
##   q-statistic      p-value
## X   0.9212376 3.764869e-10

We can visualize the mean and standard deviation of the sample in each stratum.

p = ggerrorplot(sim.join, x = "X", y = "Value", 
            desc_stat = "mean_sd", color = "black",
            add = "violin", add.params = list(color = "darkgray")
            )

p + theme(axis.title.x = element_blank())

sim.join %>%                                        
  group_by(X) %>%                         
  summarise_at(vars(Value),              
               list(name = mean))

## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOINT
## Dimension:     XY
## Bounding box:  xmin: 0.05899652 ymin: 2.041932 xmax: 3.99286 ymax: 5.921005
## Geodetic CRS:  WGS 84
## # A tibble: 4 x 3
##       X  name                                                           geometry
##   <dbl> <dbl>                                                   <MULTIPOINT [°]>
## 1     1  93.8 ((0.4581337 5.024432), (0.5549496 5.161923), (0.4016744 5.921005)~
## 2     2 413.  ((2.92014 3.070043), (2.744322 3.482087), (3.207056 3.377478), (3~
## 3     3 312.  ((0.4727599 2.430498), (0.4892411 2.141553), (1.774888 2.379797),~
## 4     4 207.  ((3.783199 2.041932), (3.495714 2.111997), (3.113503 2.943583), (~

sim.join %>%
  group_by(X) %>%
  summarise(mean=mean(Value), sd=sd(Value), n=n())

## Simple feature collection with 4 features and 4 fields
## Geometry type: MULTIPOINT
## Dimension:     XY
## Bounding box:  xmin: 0.05899652 ymin: 2.041932 xmax: 3.99286 ymax: 5.921005
## Geodetic CRS:  WGS 84
## # A tibble: 4 x 5
##       X  mean    sd     n                                               geometry
##   <dbl> <dbl> <dbl> <int>                                       <MULTIPOINT [°]>
## 1     1  93.8  24.4     9 ((0.4581337 5.024432), (0.5549496 5.161923), (0.40167~
## 2     2 413.   38.7    17 ((2.92014 3.070043), (2.744322 3.482087), (3.207056 3~
## 3     3 312.   49.5    11 ((0.4727599 2.430498), (0.4892411 2.141553), (1.77488~
## 4     4 207.   18.8     4 ((3.783199 2.041932), (3.495714 2.111997), (3.113503 ~

Running the Sandwich mapping model

The Sandwich mapping model is performed using the sandwich.model function, which returns the mean value of the sampling attribute and its standard error for each reporting unit.

# Perform the SSH based spatial interpolation
sim.sw <- sandwich.model(object=sim.data, sampling.attr="Value", type="shp")
head(sim.sw$object)

## Simple feature collection with 6 features and 4 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 5.684342e-14 ymin: 2 xmax: 4 ymax: 6
## Geodetic CRS:  WGS 84
## # A tibble: 6 x 5
##       Y                                               geometry  mean    se    df
##   <dbl>                                          <POLYGON [°]> <dbl> <dbl> <dbl>
## 1     1 ((0.6 4, 0.6 5, 0.8 5, 1 5, 1.2 5, 1.4 5, 1.6 5, 1.8 ~  381.  2.61    34
## 2     3 ((1 3.4, 1 3.2, 1 3, 1.2 3, 1.4 3, 1.6 3, 1.6 2.8, 1.~  298.  3.05    29
## 3     2 ((2.4 3.6, 1.6 3.6, 1.6 3.8, 1.6 4, 1.8 4, 2 4, 2 4.2~  262.  2.27    24
## 4     4 ((4 3.8, 4 3.6, 4 3.4, 4 3.2, 3.8 3.2, 3.8 3, 3.8 2.8~  401.  3.02    19
## 5     5 ((1 3.4, 1 4, 1.2 4, 1.4 4, 1.6 4, 1.6 3.8, 1.6 3.6, ~  390.  2.74    26
## 6     6 ((1.6 3.6, 2.4 3.6, 2.4 3.4, 2.4 3.2, 2.4 3, 2.4 2.8,~  357.  3.02    29

summary(sim.sw)

## [1] "Sample size:  41"
## [1] "Number of SSH strata:  4"
## [1] "Number of reporting units:  7"

##       mean             se                 geometry
##  Min.   :202.8   Min.   :1.668   POLYGON      :7  
##  1st Qu.:280.0   1st Qu.:2.435   epsg:4326    :0  
##  Median :357.1   Median :2.742   +proj=long...:0  
##  Mean   :327.4   Mean   :2.625                    
##  3rd Qu.:385.8   3rd Qu.:3.022                    
##  Max.   :400.6   Max.   :3.050

The interpolated values and standard errors can be visualized using plot.mean and plot.se().

# Plot the estimated mean values and standard errors
ggplot2::autoplot(object=sim.sw)

Model validation

To evaluate the overall accuracy of the Sandwich mapping model, a stratified k-fold cross validation is performed using the sandwich.cv function. A diagnostic statistic called cross-validation estimate (CVE) will be calculated. Here, we present the result of a stratified 5-fold cross validation.

# Perform k-fold cross validation
set.seed(0)
sim.cv <- sandwich.cv(object=sim.data, sampling.attr="Value", k=5, type="shp", ssh.id.col="X")
sim.cv

## [1] 110.4208

Breast cancer incidence in mainland China

The second case study interpolates the breast cancer incidence across mainland China based on 242 samples retrieved from the Chinese Cancer Registry Annual Report. The urban-rural classification (either urban or rural areas are typically not connected in space) is used as the stratification layer. County-level divisions are used as the reporting units. In this example, all the input data are in csv format.

Loading data

The load.data.txt function is used to prepare text files into a list of data frames for model input. Two files are needed for this function: one linking sampling and stratification layers, and the other linking reporting and stratification layers.

# Input data from text files
bc.sampling_ssh.name <- system.file("extdata", "bc_sampling_ssh.csv", 
                                package="sandwichr")
bc.reporting_ssh.name <- system.file("extdata", "bc_reporting_ssh.csv", 
                                 package="sandwichr")

bc.data <- load.data.txt(sampling_ssh.file=bc.sampling_ssh.name, 
                         reporting_ssh.file=bc.reporting_ssh.name)

# Sampling-stratification
head(bc.data[[1]])

##   GBCODE Incidence SSHID        X        Y
## 1 110100        42     1 115.9642 39.86494
## 2 120100        44     1 117.3913 39.01184
## 3 120225        26     2 117.4309 40.00251
## 4 130129         9     2 114.2797 37.61767
## 5 130181        36     1 115.2870 37.91756
## 6 130227        20     2 118.3576 40.23177

class(bc.data[[1]])

## [1] "data.frame"

# Reporting-stratification
head(bc.data[[2]])

##   GBCODE        W1        W2
## 1 110100 0.8736000 0.1264000
## 2 110112 0.6115000 0.3885000
## 3 110113 0.5378000 0.4622000
## 4 110221 0.5245597 0.4754403
## 5 110224 0.5245597 0.4754403
## 6 110226 0.5245597 0.4754403

class(bc.data[[2]])

## [1] "data.frame"

Calculating Moran’s I

We use the Moran’s I to evaluate the SAC of the breast cancer incidence.

bc.dists <- as.matrix(dist(cbind(bc.data[[1]]$X, bc.data[[1]]$Y)))
bc.dists.inv <- 1/bc.dists
diag(bc.dists.inv) <- 0
Moran.I(bc.data[[1]]$Incidence, bc.dists.inv)

## $observed
## [1] 0.04940197
## 
## $expected
## [1] -0.004149378
## 
## $sd
## [1] 0.008379477
## 
## $p.value
## [1] 1.650469e-10

Selection of the stratification layer(s)

First, we use the ssh.data.txt function to convert the input data for evaluation.

# Prepare the stratification layer for evaluation
bc.join <- ssh.data.txt(object=bc.data)
head(bc.join)

##   GBCODE Incidence SSHID        X        Y
## 1 110100        42     1 115.9642 39.86494
## 2 120100        44     1 117.3913 39.01184
## 3 120225        26     2 117.4309 40.00251
## 4 130129         9     2 114.2797 37.61767
## 5 130181        36     1 115.2870 37.91756
## 6 130227        20     2 118.3576 40.23177

The factor detector q-statistic in the geographical detector model is applied through the ssh.test function to measure the SSH of the sample regarding the urban-rural classification. In this example, the output of ssh.test implies a relatively high level of SSH in the distribution of breast cancer incidence concerning the given stratification (q = .52). Therefore, it is reasonable to select the urban-rural classification as the stratification layer for subsequent interpolation.

# Calculate the geographical detector q-statistic
ssh.test(object=bc.join, y="Incidence", x="SSHID", test="factor", type="txt")

## [[1]]
##       q-statistic      p-value
## SSHID   0.5165114 7.219864e-11

We can visualize the mean and standard deviation of the sampled breast cancer incidence in each stratum.

p = ggerrorplot(bc.data[[1]], x = "SSHID", y = "Incidence", 
            desc_stat = "mean_sd", color = "black",
            add = "violin", add.params = list(color = "darkgray")
            )

p + scale_x_discrete(labels=c("1" = "Urban", "2" = "Rural")) + 
  theme(axis.title.x = element_blank()) + labs(y="Breast Cancer Incidence (Rate per 100,000)")

bc.data[[1]] %>%                                        
  group_by(SSHID) %>%                         
  summarise_at(vars(Incidence),              
               list(name = mean))

## # A tibble: 2 x 2
##   SSHID  name
##   <int> <dbl>
## 1     1  31.6
## 2     2  17.5

bc.data[[1]] %>%
  group_by(SSHID) %>%
  summarise(mean=mean(Incidence), sd=sd(Incidence), n=n())

## # A tibble: 2 x 4
##   SSHID  mean    sd     n
##   <int> <dbl> <dbl> <int>
## 1     1  31.6  8.07   125
## 2     2  17.5  5.12   117

Running the Sandwich mapping model

The Sandwich mapping model is performed using the sandwich.model function, which outputs the mean value of the sampling attribute and its standard error for each reporting unit.

# Perform the SSH based spatial interpolation
bc.sw <- sandwich.model(object=bc.data, sampling.attr="Incidence", type="txt", 
                        ssh.id.col="SSHID", ssh.weights=list(c(1,2), c("W1","W2")))
head(bc.sw$object)

##   GBCODE        W1        W2     mean        se  df
## 1 110100 0.8736000 0.1264000 29.78768 0.7965111 240
## 2 110112 0.6115000 0.3885000 26.11263 0.6894776 240
## 3 110113 0.5378000 0.4622000 25.07924 0.6604525 240
## 4 110221 0.5245597 0.4754403 24.89360 0.6552986 240
## 5 110224 0.5245597 0.4754403 24.89360 0.6552986 240
## 6 110226 0.5245597 0.4754403 24.89360 0.6552986 240

summary(bc.sw)

## [1] "Sample size:  242"
## [1] "Number of SSH strata:  2"
## [1] "Number of reporting units:  2352"

##       mean             se        
##  Min.   :17.54   Min.   :0.4732  
##  1st Qu.:21.22   1st Qu.:0.5579  
##  Median :22.38   Median :0.5876  
##  Mean   :22.73   Mean   :0.5986  
##  3rd Qu.:23.86   3rd Qu.:0.6270  
##  Max.   :31.56   Max.   :0.8498

Model validation

To evaluate the overall accuracy of SSH-based spatial interpolation, a k-fold cross validation can be performed using the sandwich.cv function. A diagnostic statistic called root mean square error (RMSE) will be calculated. Here, we present the result of a 5-fold cross validation.

# Perform k-fold cross validation
set.seed(0)
bc.cv <- sandwich.cv(object=bc.data, sampling.attr="Incidence", k=5, type="txt", 
                     ssh.id.col="SSHID", reporting.id.col="GBCODE", 
                     ssh.weights=list(c(1,2), c("W1","W2")))
bc.cv

## [1] 8.680166

Wang, J. F., Haining, R., Liu, T. J., Li, L. F., & Jiang, C. S. (2013). Sandwich estimation for multi-unit reporting on a stratified heterogeneous surface. Environment and Planning A, 45(10), 2515-2534.↩︎
Wang, J. F., Haining, R., Liu, T. J., Li, L. F., & Jiang, C. S. (2013). Sandwich estimation for multi-unit reporting on a stratified heterogeneous surface. Environment and Planning A, 45(10), 2515-2534.↩︎
Wang, J. F., Li, X. H., Christakos, G., Liao, Y. L., Zhang, T., Gu, X., & Zheng, X. Y. (2010). Geographical detectors-based health risk assessment and its application in the neural tube defects study of the Heshun Region, China. International Journal of Geographical Information Science, 24(1), 107-127.↩︎

A Tutorial for the `sandwichr` R Package

2023-01-08

Simulated data

Loading data

Calculating Moran’s I

Selection of the stratification layer(s)

Running the Sandwich mapping model

Model validation

Breast cancer incidence in mainland China

Loading data

Calculating Moran’s I

Selection of the stratification layer(s)

Running the Sandwich mapping model

Model validation