If the code in this vignette has not been evaluated, a rendered version is available on the documentation site under ‘Articles’.
library(ggplot2)
library(dplyr)
library(sdmTMB)
In this vignette, we describe the basic steps to fitting a spatial or spatiotemporal GLMM with sdmTMB. This type of model can be useful for (dynamic, i.e. changing through time) species distribution models and relative abundance index standardization among many other uses. See the model description for full model structure and equations.
We will use built-in package data for Pacific cod from a fisheries independent trawl survey.
sdmTMB::add_utm_columns()
.depth_scaled
) and depth
squared (depth_scaled2
).glimpse(pcod)
The most basic model structure possible in sdmTMB replicates a GLM as
can be fit with glm()
or a GLMM as can be fit with lme4 or
glmmTMB, for example. The spatial components in sdmTMB are included as
random fields using a triangulated mesh with vertices, known as knots,
used to approximate the spatial variability in observations. Bilinear
interpolation is used to approximate a continuous spatial field (Rue et
al., 2009; Lindgren et al., 2011) from the estimated values of the
spatial surface at these knot locations to other locations including
those of actual observations. These spatial random effects are assumed
to be drawn from Gaussian Markov random fields (e.g., Cressie &
Wikle, 2011; Lindgren et al., 2011) with covariance matrices that are
constrained by Matérn covariance functions (Cressie & Wikle,
2011).
There are different options for creating the spatial mesh (see
sdmTMB::make_mesh()
). We will start with a relatively
coarse mesh for a balance between speed and accuracy
(cutoff = 10
, where cutoff is in the units of X and Y (km
here) and represents the minimum distance between knots before a new
mesh vertex is added). Smaller values create meshes with more knots. You
will likely want to use a higher resolution mesh (more knots) in applied
scenarios, but care must be taken to avoid overfitting. The circles
represent observations and the vertices are the knot locations.
<- make_mesh(pcod, c("X", "Y"), cutoff = 10)
mesh plot(mesh)
We will start with a logistic regression of Pacific cod presence in
tows as a function of depth and depth squared. We will first use
sdmTMB()
without any spatial random effects
(spatial = "off"
):
<- sdmTMB(
m data = pcod,
formula = present ~ depth_scaled + depth_scaled2,
mesh = mesh, # can be omitted for a non-spatial model
family = binomial(link = "logit"),
spatial = "off"
)
mAIC(m)
For comparison, here’s the same model with glm()
:
<- glm(
m0 data = pcod,
formula = present ~ depth_scaled + depth_scaled2,
family = binomial(link = "logit")
)summary(m0)
Notice that the AIC, log likelihood, parameter estimates, and standard errors are all identical.
Next, we can incorporate spatial random effects into the above model
by changing spatial
to "on"
and see that this
changes coefficient estimates:
<- sdmTMB(
m1 data = pcod,
formula = present ~ depth_scaled + depth_scaled2,
mesh = mesh,
family = binomial(link = "logit"),
spatial = "on"
)
m1AIC(m1)
To add spatiotemporal random fields to this model, we need to include
both the time argument that indicates what column of your data frame
contains the time slices at which spatial random fields should be
estimated (e.g., time = "year"
) and we need to choose
whether these fields are independent and identically distributed
(spatiotemporal = "IID"
), first-order autoregressive
(spatiotemporal = "AR1"
), or as a random walk
(spatiotemporal = "RW"
). We will stick with IID for these
examples.
<- sdmTMB(
m2 data = pcod,
formula = present ~ depth_scaled + depth_scaled2,
mesh = mesh,
family = binomial(link = "logit"),
spatial = "on",
time = "year",
spatiotemporal = "IID"
) m2
We can also model biomass density using a Tweedie distribution. We’ll
switch to poly()
notation to make some of the plotting
easier.
<- sdmTMB(
m3 data = pcod,
formula = density ~ poly(log(depth), 2),
mesh = mesh,
family = tweedie(link = "log"),
spatial = "on",
time = "year",
spatiotemporal = "IID"
) m3
We can view the confidence intervals on the fixed effects by using the tidy function:
tidy(m3, conf.int = TRUE)
And similarly for the random effect and variance parameters:
tidy(m3, "ran_pars", conf.int = TRUE)
Note that standard errors are not reported when coefficients are in log space, but the confidence intervals are reported. These parameters are defined as follows:
range
: A derived parameter that defines the distance
at which 2 points are effectively independent (actually about 13%
correlated). If the share_range
argument is changed to
FALSE
then the spatial and spatiotemporal ranges will be
unique, otherwise the default is for both to share the same
range.
phi
: Observation error scale parameter (e.g., SD in
Gaussian).
sigma_O
: SD of the spatial process
(“Omega”).
sigma_E
: SD of the spatiotemporal process
(“Epsilon”).
tweedie_p
: Tweedie p (power) parameter; between 1
and 2.
If the model used AR1 spatiotemporal fields then:
rho
: Spatiotemporal correlation between years; between
-1 and 1.
If the model includes a spatial_varying
predictor
then:
sigma_Z
: SD of spatially varying coefficient field
(“Zeta”).
We can inspect randomized quantile residuals:
$resids <- residuals(m3) # randomized quantile residuals
pcodqqnorm(pcod$resids)
qqline(pcod$resids)
ggplot(pcod, aes(X, Y, col = resids)) +
scale_colour_gradient2() +
geom_point() +
facet_wrap(~year) +
coord_fixed()
Those were fast to calculate but can look ‘off’ even when the model
is consistent with the data. MCMC-based residuals are more reliable but
slow. In practice you would like want more mcmc_iter
and
mcmc_warmup
. Total samples is
mcmc_iter - mcmc_warmup
. We will also just use the spatial
model here so the vignette builds quickly.
<- residuals(m3, "mle-mcmc", mcmc_warmup = 100, mcmc_iter = 101)
r qqnorm(r)
qqline(r)
See ?residuals.sdmTMB()
.
Now, for the purposes of this example (e.g., visualization), we want
to predict on a fine-scale grid on the entire survey domain. There is a
grid built into the package for Queen Charlotte Sound named
qcs_grid
. Our prediction grid also needs to have all the
covariates that we used in the model above.
glimpse(qcs_grid)
Now we will make the predictions on new data:
<- predict(m3, newdata = qcs_grid) predictions
Let’s make a small function to help make maps.
<- function(dat, column) {
plot_map ggplot(dat, aes(X, Y, fill = {{ column }})) +
geom_raster() +
coord_fixed()
}
There are four kinds of predictions that we get out of the model.
First, we will show the predictions that incorporate all fixed effects and random effects:
plot_map(predictions, exp(est)) +
scale_fill_viridis_c(
trans = "sqrt",
# trim extreme high values to make spatial variation more visible
na.value = "yellow", limits = c(0, quantile(exp(predictions$est), 0.995))
+
) facet_wrap(~year) +
ggtitle("Prediction (fixed effects + all random effects)",
subtitle = paste("maximum estimated biomass density =", round(max(exp(predictions$est))))
)
We can also look at just the fixed effects, here only a quadratic effect of depth:
plot_map(predictions, exp(est_non_rf)) +
scale_fill_viridis_c(trans = "sqrt") +
ggtitle("Prediction (fixed effects only)")
We can look at the spatial random effects that represent consistent deviations in space through time that are not accounted for by our fixed effects. In other words, these deviations represent consistent biotic and abiotic factors that are affecting biomass density but are not accounted for in the model.
plot_map(predictions, omega_s) +
scale_fill_gradient2() +
ggtitle("Spatial random effects only")
And finally we can look at the spatiotemporal random effects that represent deviation from the fixed effect predictions and the spatial random effect deviations. These represent biotic and abiotic factors that are changing through time and are not accounted for in the model.
plot_map(predictions, epsilon_st) +
scale_fill_gradient2() +
facet_wrap(~year) +
ggtitle("Spatiotemporal random effects only")
We can also estimate the uncertainty in our spatiotemporal density
predictions using simulations from the joint precision matrix by setting
nsim > 0
in the predict function. Here we generate 100
estimates and use apply()
to calculate upper and lower
confidence intervals, a standard deviation, and a coefficient of
variation (CV).
<- predict(m3, newdata = qcs_grid, nsim = 100)
sim <- sim[qcs_grid$year == max(qcs_grid$year), ] # just plot last year
sim_last <- predictions[predictions$year == max(qcs_grid$year), ]
pred_last $lwr <- apply(exp(sim_last), 1, quantile, probs = 0.025)
pred_last$upr <- apply(exp(sim_last), 1, quantile, probs = 0.975)
pred_last$sd <- round(apply(exp(sim_last), 1, function(x) sd(x)), 2)
pred_last$cv <- round(apply(exp(sim_last), 1, function(x) sd(x) / mean(x)), 2) pred_last
Plot the CV on the estimates:
ggplot(pred_last, aes(X, Y, fill = cv)) +
geom_raster() +
scale_fill_viridis_c()
We can visualize the conditional effect of any covariates by feeding simplified data frames to the predict function that fix covariate values we want fixed (e.g., at means) and vary parameters we want to visualize (across a range of values):
<- data.frame(
nd depth = seq(min(pcod$depth),
max(pcod$depth),
length.out = 100
),year = 2015L # a chosen year
)<- predict(m3, newdata = nd, se_fit = TRUE, re_form = NA)
p
ggplot(p, aes(depth, exp(est),
ymin = exp(est - 1.96 * est_se),
ymax = exp(est + 1.96 * est_se)
+
)) geom_line() +
geom_ribbon(alpha = 0.4) +
scale_x_continuous() +
coord_cartesian(expand = F) +
labs(x = "Depth (m)", y = "Biomass density (kg/km2)")
We could also do this with the visreg package. This version is in
link space and the residuals are partial randomized quantile residuals.
See the scale
argument in visreg for response scale
plots.
::visreg(m3, "depth") visreg
Or the ggeffects package for a marginal effects plot. This will also be faster since it relies on the already estimated coefficients and variance-covariance matrix.
::ggeffect(m3, "depth [0:500 by=1]") %>% plot() ggeffects
We could also let the effect of depth vary through time. In this
example, it helps to give each year a separate mean effect
(as.factor(year)
). The ~ 0
part of the formula
omits the intercept. For models like this that take longer to fit, we
might want to set silent = FALSE
so we can monitor
progress.
<- sdmTMB(
m4 ~ 0 + as.factor(year),
density data = pcod,
time_varying = ~ 0 + depth_scaled + depth_scaled2,
mesh = mesh,
family = tweedie(link = "log"),
spatial = "on",
time = "year",
spatiotemporal = "IID"
)
m4AIC(m4)
To plot these, we make a data frame that contains all combinations of
the time-varying covariate and time. This is easily created using
expand.grid()
or tidyr::expand_grid()
.
<- expand.grid(
nd depth_scaled = seq(min(pcod$depth_scaled) + 0.2,
max(pcod$depth_scaled) - 0.2,
length.out = 50
),year = unique(pcod$year) # all years
)$depth_scaled2 <- nd$depth_scaled^2
nd
<- predict(m4, newdata = nd, se_fit = TRUE, re_form = NA)
p
ggplot(p, aes(depth_scaled, exp(est),
ymin = exp(est - 1.96 * est_se),
ymax = exp(est + 1.96 * est_se),
group = as.factor(year)
+
)) geom_line(aes(colour = year), lwd = 1) +
geom_ribbon(aes(fill = year), alpha = 0.1) +
scale_colour_viridis_c() +
scale_fill_viridis_c() +
scale_x_continuous(labels = function(x) round(exp(x * pcod$depth_sd[1] + pcod$depth_mean[1]))) +
coord_cartesian(expand = F) +
labs(x = "Depth (m)", y = "Biomass density (kg/km2)")