Panel data is common in various fields such as social sciences. These data consists of multiple subjects, followed over several time points, and there are often many observations per individual at each time, for example family status and income of each individual at each time point of interest. Such data can be analyzed in various ways, depending on the research questions and the characteristics of the data such as the number of individuals and time points, and the assumed distribution of the response variables. Popular, somewhat overlapping modelling approaches include dynamic panel models, fixed effect models, cross-lagged panel models (CLPM), CLPM with fixed or random effects Mulder and Hamaker (2021). Parameter estimation is typically based on ordinary least squares, generalized method of moments or maximum likelihood using structural equation models.
There are several R (R Core Team 2021) packages in CRAN
focusing on panel data analysis. plm
(Croissant and
Millo 2008) provides various estimation methods and tests for
linear panel data model, while fixest
(Bergé
2018) supports multiple fixed effects and different
distributions of response variables (based on
stats::family
). panelr
(Long
2020) contains tools for panel data manipulation and
estimation methods for “within-between” models which combine fixed
effect and random effect models. This is done by using lme4
(Bates
et al. 2015), geepack
(Halekoh, Højsgaard, and Yan
2006), and brms
(Bürkner 2018) packages
as a backend. Finally, the lavaan
(Rosseel
2012) package provides methods for general structural
equation modelling, and thus can be used to estimate various panel data
models such as cross-lagged panel models with fixed or random
intercepts.
In traditional panel data models such as the ones provided by the aforementioned packages, the number of time points considered is often assumed to be relatively small, say less than 10, while the number of individuals can be for example hundreds or thousands. Due to the small number of time points, the effects of predictors are typically assumed to be time-invariant (although some extensions to time-varying effects have emerged (Hayakawa and Hou 2019)). On the other hand, when the number of time points is moderate or large, say hundreds or thousands, it can be reasonable to assume that the dynamics of the system change over time. There are various methods to model time-varying effects in (generalized) linear models based on state space models (Harvey and Phillips 1982; Durbin and Koopman 2012), with various R implementations (J. Helske 2022; Knaus et al. 2021; Brodersen et al. 2014). The state space modelling approaches are highly flexible but often computationally demanding due to assumed latent processes for the regression coefficients and an analytically intractable marginal likelihood, so they are most useful in settings with only few subjects. Methods based on the varying coefficients models (Hastie and Tibshirani 1993; Eubank et al. 2004) can be computationally more convenient, especially when the varying coefficients are based on splines. Some of the R implementations of time-varying coefficient models include (Casas and Fernandez-Casal 2019; Dziak et al. 2021).
The dynamite
package provides an alternative approach to
panel data inference which avoids some of the limitations and drawbacks
of the aforementioned methods. First, dynamite
supports
estimating effects which vary smoothly over time according to splines,
with penalization based on random walk priors. This allows modelling for
example effects of interventions which increase or decrease over time.
Second, dynamite
supports wide variety of distributions for
the response variables such as categorical distribution commonly
encountered in life-course research (S. Helske, Helske, and
Eerola 2018). Third, estimation is done in a fully Bayesian
fashion using Stan
(Stan Development Team 2022b,
2022a), which leads
to transparent and interpretable quantification of parameter and
predictive uncertainty. Finally, with dynamite
we can model
arbitrary number of measurements per individual (i.e. multiple channels
following (S. Helske and Helske 2019)). Jointly
modelling all endogenous variables allows us to consider long-term
effects of interventions which take into account the interdependency of
several variables of the model, and allows realistic simulation of
complex counterfactual scenarios.
Consider an individual \(i\) with observations \(y_{i,t} = (y^1_{i,t}, \ldots, y^c_{i,t},\ldots, y^C_{i,t})\), \(t=1,\ldots,T\), \(i = 1,\ldots,N\) i.e. at each time point \(t\) we have \(C\) observations from \(N\) individuals. Assume that each element of \(y_{i,t}\) can depend on the past values \(y_{i,t-1}\), as well as additional exogenous covariates \(x_{i,t}\). Then, assuming that the elements of \(y_{i,t}\) are independent given \(y_{i, t-1}\) and \(x_{i,t}\), we have \[ \begin{aligned} y_{i,t} &\sim p_t(y_{i,t} | y_{i,t-1},x_{i,t}) = \prod_{c=1}^C p^c_t(y^c_{i,t} | y^1_{i,t-1}, \ldots, y^C_{i,t-1}, x_{i,t}). \end{aligned} \] Here the parameters of the conditional distributions \(p^c\) depend on time, which allows us to take into account the fact that the dynamics of our system can evolve over time.
For \(p^c_t\), given a suitable link function depending on our distributional assumptions, we define a linear predictor \(\eta^c_{i,t}\) for the channel \(c\) with a following general form: \[ \begin{aligned} \eta^c_{i,t} &= \nu_{i} + \alpha^c_{t} + Z^c_{i,t} \beta^c + X^c_{i,t} \delta^c_{t} \end{aligned} \] \(\nu_{i}\) is the individual-specific random intercept, \(\alpha_t\) (possibly time-varying) is the common intercept term, \(Z^c_{i,t}\) defines the predictors corresponding to the vector of time-invariant coefficients \(\beta^c\), and similarly for the time-varying part \(X^c_{i,t} \delta^c_{t}\). The random intercept terms \(\nu_{i}\) are assumed to follow zero-mean Gaussian and these can be correlated across channels. Note that dynamite does not center the response nor predictor variables.
For time-varying coefficients \(\delta\) (and similarly for \(\alpha\)), we use Bayesian P-splines (penalized B-splines) (Lang and Brezger 2004) where \[ \delta^c_{k,t} = B_t \omega_k^c, \quad k=1,\ldots,K, \] where \(K\) is the number of covariates, \(B_t\) is a vector of B-spline values at time \(t\) and \(\omega_k^c\) is vector of corresponding spline coefficients. In general the number of B-splines \(D\) used for constructing the splines for the study period \(1,\ldots,T\) can be chosen freely, but too large \(D\) can result in overfitting. To mitigate this, we define a random walk prior for \(\omega^c_k\) as \[ \omega^c_{k,1} \sim p(\omega^c_{k,1}), \quad \omega^c_{k,d} \sim N(\omega^c_{k,d-1}, (\tau^c_k)^2), \quad d=2, \ldots, D. \] with user defined prior \(p(\omega^c_{k,1})\) on the first coefficient, which due to the structure of \(B_1\) corresponds to the prior on the \(\delta^c_{k,1}\). Here, the parameter \(\tau^c_k\) controls the smoothness of the spline curves.
The models in the dynamite
package are defined by
combining the channel-specific formulas defined via the function
dynamiteformula
for which a shorthand alias
obs
is available. The function obs
takes two
arguments: formula
and family
, which define
how the response variable of the channel depends on the predictor
variables in the standard R formula syntax and the family of the
response variable, respectively. These channel specific definitions are
then combined into a single model with +
. For example, the
following formula
obs(y ~ lag(x), family = "gaussian") +
obs(x ~ z, family = "poisson")
defines a model with two channels; first we declare that
y
is a gaussian variable depending on the previous value of
x
(lag(x)
) and then we add a second channel
declaring x
as Poisson distributed depending on some
exogenous variable z
(for which we do not define any
distribution). Note that the model formula can be defined without any
reference to some external data, just as an R formula can. Currently the
dynamite
package supports the following distributions for
the observations:
categorical
(with a softmax link using the
first category as reference). See the documentation of the
categorical_logit_glm
in the Stan function reference manual
(https://mc-stan.org/users/documentation/).gaussian
(identity link, parameterized using
mean and standard deviation).poisson
(log-link, with an optional known
offset variable).negbin
(log-link, using mean and
dispersion parameterization, with an optional known offset variable).
See the documentation on NegBinomial2
in the Stan function
reference manual.bernoulli
(logit-link).binomial
(logit-link).exponential
(log-link).gamma
(log-link, using mean and shape
parameterization).beta
(logit-link, using mean and precision
parameterization).The dynamite
models do not support contemporaneous
dependencies in order to avoid complex cyclic dependencies which would
make handling missing data, subsequent predictions, and causal inference
challenging or impossible. In other words, response variables can only
depend on the previous values of other response variables. This is why
we used lag(x)
in the previous example, a shorthand for
lag(x, k = 1)
, which defines a one-step lag of the variable
x
. Longer lags can also be defined by adjusting the
argument k
.
A special model component lags
can also be used to
quickly add lagged responses as predictors. This component adds a lagged
value of each response in the model as a predictor to every channel. For
example, calling
obs(y ~ z, family = "gaussian") +
obs(x ~ z, family = "poisson") +
lags(k = 1)
would add lag(y, k = 1)
as a predictor of x
and conversely, lag(x, k = 1)
as a predictor of
y
. Therefore, the previous code would produce the exact
same model as writing
obs(y ~ z + lag(x, k = 1), family = "gaussian") +
obs(x ~ z + lag(y, k = 1), family = "poisson")
Just as with the function lag()
, the argument
k
in lags
can be adjusted to add longer lags
of each response to each channel, but for lags
it can also
be a vector. The inclusion of lagged response variables in the model
implies that some time points have to be considered fixed in the
estimation. The number of fixed time points in the model is equal to the
largest shift value of any observed response variable in the model
(defined either via lag
or the global
lags
).
The formula within obs
can also contain an additional
special function varying
, which defines the time-varying
part of the model equation, in which case we could write for example
obs(x ~ z + varying(~ -1 + w), family = "poisson")
, which
defines a model equation with a constant intercept and time-invariant
effect of z
, and a time-varying effect of w
.
We also remove the duplicate intercept with -1
in order to
avoid identifiability issues in the model estimation (we could also
define a time varying intercept, in which case we would write
obs(x ~ -1 + z + varying(~ w), family = "poisson)
). The
part of the formula not wrapped with varying
is assumed to
correspond to the fixed part of the model, so
obs(x ~ z + varying(~ -1 + w), family = "poisson")
is
actually identical to
obs(x ~ -1 + fixed(~ z) + varying(~ -1 + w), family = "poisson")
and
obs(x ~ fixed(~ z) + varying(~ -1 + w), family = "poisson")
.
The use of fixed
is therefore optional.
When defining varying effects, we also need to define how the these
time-varying regression coefficient behave. For this, a
splines
component should be added to the model, e.g.
obs(x ~ varying(~ -1 + w), family = "poisson) + splines(df = 10)
defines a cubic B-spline with 10 degrees of freedom for the time-varying
coefficient corresponding to the w
. If the model contains
multiple time-varying coefficients, same spline basis is used for all
coefficients, with unique spline coefficients and their standard
deviation, as defined in the previous Section.
Defining the random intercept term for each group ca be done using
component random
where the first argument defines for which
channels the intercept should be added, and second argument defines
whether or not these intercepts should be correlated between channels.
This leads to a model where the in addition to the common intercept each
individual/group has their own intercept with zero-mean normal prior and
unknown standard deviation (or multivariate gaussian in case argument
correlated = TRUE
), analogously with the typical mixed
models. Note however, that currently random intercepts are not supported
for categorical distribution due to technical reasons.
In addition to declaring response variables via obs
, we
can also use the function aux
to define auxiliary channels
which are deterministic functions of other variables. Defining these
auxiliary variables explicitly instead of defining them implicitly on
the right-hand side of the formulas (i.e., by using the “as is” function
I()
) makes the subsequent prediction steps more clear and
allows easier checks on the model validity. In fact, we do not allow the
use of I()
within dynamiteformula
. The values
of auxiliary variables are computed dynamically during prediction,
making the use of lagged values and other transformations possible. An
example of a formula using an auxiliary channel could be
obs(y ~ lag(log1x), family = "gaussian") +
obs(x ~ z, family = "poisson") +
aux(numeric(log1x) ~ log(1 + x) | init(0))
For auxiliary channels, the formula declaration ~
should
be understood as a mathematical equality, where the right hand side
provides the defining expression. Thus the example above defines a new
auxiliary channel whose response is log1x
defined as the
logarithm of 1 + x
and assigns it to be of type
numeric
. The type declaration is required, because it might
not be possible to unambiguously determine the type of the response
variable based on its expression alone from the data, especially for
factor
responses.
Auxiliary variables can be used directly in the formulas of other
channels, just like any other variable. Note that an auxiliary channel
can also depend on other response variables without lags, unlike
observed channels. The function aux
also does not use the
family
argument, which is automatically set to
deterministic
and is a special channel type of
obs
. Note that lagged values of deterministic
aux
channels do not imply fixed time points. Instead they
must be given starting values using one of the two special functions,
init
or past
after the main formula separated
by the |
symbol.
In the example above, because the channel for y
contains
a lagged value of log1x
as a predictor, we also need to
supply log1x
with a single initial value that determines
the value of the lag at the first time point. Here, init(0)
defines the initial value of lag(log1x)
to be zero. In
general, if the model contains higher order lags of an auxiliary
channel, then init
can be supplied with a vector
initializing each lag.
Note that init
defines the same starting value to be
used for all individuals. As an alternative, the function
past
can be used, which takes an R expression as its
argument, and computes the starting value for each individual based on
that expression. The expression is evaluated in the context of the
data
supplied to dynamite
. For example,
instead of init(0)
in the example above, we could
write:
obs(y ~ lag(log1x), family = "gaussian") +
obs(x ~ z, family = "poisson") +
aux(numeric(log1x) ~ log(1 + x) | past(log(z)))
which defines that the value of lag(log1x)
at the first
time point is log(z)
for each individual, using the value
of z
in the data to be supplied to compute the actual value
of the expression. The function past
can also be used if
the model contains higher order lags of auxiliary responses. In this
case, more observations from the variables bound by the expression given
as the argument will simply be used.
The declared model formula is supplied to dynamite
,
which has the following arguments and default values:
dynamite(dformula, data, group = NULL, time, priors = NULL, verbose = TRUE,
debug = NULL, ...)
The first argument dformula
is the
dynamiteformula
object that defines the model and the
second argument data
is a data.frame
that
contains the variables used in the model formula. The argument
group
is a column name of data
that specified
the unique groups (individuals) and similarly, time
is a
column name of data
that specifies the unique time points.
Note that group
is optional: when group = NULL
we assume that there is only a single group (or individual). The
argument priors
supplies optional user-defined priors for
the model parameters, and ...
supplies additional arguments
to rstan::sampling
which handles the model estimation.
Finally, verbose
controls the verbosity of the output and
debug
can be used for various debugging options (see
?dynamite
for further information on these parameters). For
example,
dynamite(
dformula = obs(x ~ varying(~ -1 + w), family = "poisson") + splines(df = 10),
data = d,
group = "id"
time = "year",
chains = 2,
cores = 2
)
estimates the model using the data in the data frame d
,
which contains variables year
and id
(defining
the time-index and group-index variables of the data). Arguments
chains
and cores
are passed to
rstan::sampling
which then uses to parallel Markov chains
in the the Markov chain Monte Carlo (MCMC) sampling of the model
parameters.
As an example, consider the following simulated multichannel data:
library(dynamite)
head(multichannel_example)
#> id time g p b
#> 1 1 1 -0.6264538 5 1
#> 2 1 2 -0.2660091 12 0
#> 3 1 3 0.4634939 9 1
#> 4 1 4 1.0451444 15 1
#> 5 1 5 1.7131026 10 1
#> 6 1 6 2.1382398 8 1
The data contains 50 unique groups (variable id
), over
20 time points (time
), a continuous variable
g
, a variable with non-negative integer values
p
, and a binary variable b
. We define the
following model (which actually matches the data generating process used
in simulating the data):
set.seed(1)
<- obs(g ~ lag(g) + lag(logp), family = "gaussian") +
f obs(p ~ lag(g) + lag(logp) + lag(b), family = "poisson") +
obs(b ~ lag(b) * lag(logp) + lag(b) * lag(g), family = "bernoulli") +
aux(numeric(logp) ~ log(p + 1))
As a directed acyclic graph (DAG), the model for the first three time
points is
(The deterministic channel logp
is omitted for
clarity).
We fit the model using the dynamite
function. In order
to satisfy the package size requirements of CRAN, we have to use small
number of iterations and additional thinning:
<- dynamite(
multichannel_example_fit "id", "time",
f, multichannel_example, chains = 1, cores = 1, iter = 2000, warmup = 1000, init = 0, refresh = 0,
thin = 5, save_warmup = FALSE)
You can update the estimated model which is included in the package as
<- update(multichannel_example_fit,
multichannel_example_fit iter = 2000,
warmup = 1000,
chains = 4,
cores = 4,
refresh = 0,
save_warmup = FALSE
)
Note that the dynamite
call above produces a warning
message related to the deterministic channel logp
:
#> Warning: Deterministic channel `logp` has a maximum lag of 1 but you've supplied no
#> initial values:
#> ℹ This may result in NA values for `logp`.
This happens because the formulas of the other channels have
lag(logp)
as a predictor, whose value is undefined at the
first time point without the initial value. However, in this case we can
safely ignore the warning because the model contains lags of
b
and g
as well meaning that the first time
point in the model is fixed and does not enter into the model fitting
process.
We can obtain posterior samples or summary statistics of the model
using the as.data.frame
, coef
, and
summary
functions, but here we opt for visualizing the
results using the plot_betas
function:
plot_betas(multichannel_example_fit)
Note the naming of the model parameters; for example
beta_b_g_lag1
corresponds to a fixed coefficient
beta
of lagged variable g
of the channel
b
.
Assume now that we are interested in studying the causal effect of variable \(b_5\) on variable \(g_t\) at \(t = 6, \ldots, 20\). There is no direct effect from \(b_5\) to \(g_6\), but as \(g_t\) at affects \(b_{t+1}\) (and \(p_{t+1}\)), which in turn has an effect on all the variables at \(t+2\), we should see an indirect effect of \(b_5\) to \(g_t\) from time \(t=7\) onward.
For this task, we first create a new dataset where the values of our response variables after time 5 are set to NA:
<- multichannel_example
d <- d |>
d ::mutate(
dplyrg = ifelse(time > 5, NA, g),
p = ifelse(time > 5, NA, p),
b = ifelse(time > 5, NA, b))
We then predict time points 6 to 20 when \(b\) at time 5 is 0 or 1 for everyone:
<- d |>
newdata0 ::mutate(
dplyrb = ifelse(time == 5, 0, b))
<- predict(multichannel_example_fit, newdata = newdata0, type = "mean")
pred0 <- d |>
newdata1 ::mutate(
dplyrb = ifelse(time == 5, 1, b))
<- predict(multichannel_example_fit, newdata = newdata1, type = "mean") pred1
By default, the output from predict
is a single data
frame containing the original new data and the samples from the
posterior predictive distribution new observations. By defining
type = "mean"
we specify that we are interested in the
posterior distribution of the expected values instead. In this case the
the predicted values in a data frame are in the columns
g_mean
, p_mean
, and b_mean
where
the NA values of newdata
are replaced with the posterior
samples from the model (the output also contains additional column
corresponding to the auxiliary channel logp
and index
variable .draw
). For more memory efficient format, by using
argument expand = FALSE
the output is returned as a
list
with two components, simulated
and
observed
, with new samples and original
newdata
respectively.
head(pred0, n = 10) # first rows are NA as those were treated as fixed
#> g_mean p_mean b_mean logp id time .draw g p b
#> 1 NA NA NA 1.7917595 1 1 1 -0.6264538 5 1
#> 2 NA NA NA 2.5649494 1 2 1 -0.2660091 12 0
#> 3 NA NA NA 2.3025851 1 3 1 0.4634939 9 1
#> 4 NA NA NA 2.7725887 1 4 1 1.0451444 15 1
#> 5 NA NA NA 2.3978953 1 5 1 1.7131026 10 0
#> 6 1.825097 3.465729 0.7041404 1.0986123 1 6 1 NA NA NA
#> 7 1.565374 4.009897 0.7634821 0.6931472 1 7 1 NA NA NA
#> 8 1.345664 1.110993 0.6366558 0.0000000 1 8 1 NA NA NA
#> 9 1.092776 2.040169 0.7761519 1.6094379 1 9 1 NA NA NA
#> 10 1.448611 2.162044 0.6623087 1.3862944 1 10 1 NA NA NA
We can now compute summary statistics over the individuals and then over the posterior samples to obtain posterior distribution of the causal effects as
<- dplyr::bind_rows(
sumr list(b0 = pred0, b1 = pred1), .id = "case") |>
::group_by(case, .draw, time) |>
dplyr::summarise(mean_t = mean(g_mean)) |>
dplyr::group_by(case, time) |>
dplyr::summarise(
dplyrmean = mean(mean_t),
q5 = quantile(mean_t, 0.05, na.rm = TRUE),
q95 = quantile(mean_t, 0.95, na.rm = TRUE))
#> `summarise()` has grouped output by 'case', '.draw'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'case'. You can override using the
#> `.groups` argument.
It is also possible to do the marginalization over groups inside
predict
by using the argument funs
which can
be used to provide a named list of lists which contains functions to be
applied for the corresponding channel. This approach can save
considerably amount of memory in case of large number of groups. The
names of the outermost list should be either channel names with or
without suffix _mean
depending on whether the functions
should be applied to response or mean. In our case we can write
<- predict(multichannel_example_fit, newdata = newdata0, type = "mean",
pred0b funs = list(g_mean = list(mean_t = mean)))$simulated
<- predict(multichannel_example_fit, newdata = newdata1, type = "mean",
pred1b funs = list(g_mean = list(mean_t = mean)))$simulated
<- dplyr::bind_rows(
sumrb list(b0 = pred0b, b1 = pred1b), .id = "case") |>
::group_by(case, time) |>
dplyr::summarise(
dplyrmean = mean(mean_t_g_mean),
q5 = quantile(mean_t_g_mean, 0.05, na.rm = TRUE),
q95 = quantile(mean_t_g_mean, 0.95, na.rm = TRUE))
The resulting tibble sumrb
is equal to the previous
sumr
(apart from randomness due to simulation of new
trajectories).
We can the visualize our predictions with the ggplot2
(Wickham 2016) as
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.2.2
ggplot(sumr, aes(time, mean)) +
geom_ribbon(aes(ymin = q5, ymax = q95), alpha = 0.5) +
geom_line() +
scale_x_continuous(n.breaks = 10) +
facet_wrap(~ case) +
theme_bw()
#> Warning: Removed 5 rows containing missing values (`geom_line()`).
Note that these estimates do indeed coincide with the causal effects
(assuming of course that our model is correct), as we can find the
backdoor adjustment formula (Pearl 1995) \[
E(g_t | do(b_5 = x)) = \sum_{g_5, p_5}E(g_t | b_5 = x, g_5, p_5)
P(g_5, p_5),
\] which is equal to mean_t
in above codes.
We can also compute the difference \(P(g_t | do(b_5 = 1)) - P(g_t | do(b_5 = 0))\) as
<- dplyr::bind_rows(
sumr_diff list(b0 = pred0, b1 = pred1), .id = "case") |>
::group_by(.draw, time) |>
dplyr::summarise(
dplyrmean_t = mean(g_mean[case == "b1"] - g_mean[case == "b0"])) |>
::group_by(time) |>
dplyr::summarise(
dplyrmean = mean(mean_t),
q5 = quantile(mean_t, 0.05, na.rm = TRUE),
q95 = quantile(mean_t, 0.95, na.rm = TRUE))
#> `summarise()` has grouped output by '.draw'. You can override using the
#> `.groups` argument.
ggplot(sumr_diff, aes(time, mean)) +
geom_ribbon(aes(ymin = q5, ymax = q95), alpha = 0.5) +
geom_line() +
scale_x_continuous(n.breaks = 10) +
theme_bw()
#> Warning: Removed 5 rows containing missing values (`geom_line()`).
Which shows that there is a short term effect of \(b\) on \(g\), although the posterior uncertainty is
quite large. Note that for multistep causal effects these estimates
contain also Monte Carlo variation due to the simulation of the
trajectories (e.g., there is also uncertainty in mean_t
for
fixed parameters given finite number of individuals we are marginalizing
over).