To install the stable version of mcglm
, use
devtools::install_git()
. For more information, visit mcglm/README.
library(devtools)
install_git("wbonat/mcglm")
library(mcglm)
packageVersion("mcglm")
The mcglm
package implements the multivariate covariance
generalized linear models (McGLMs) proposed by Bonat and J\(\o\)rgensen (2016). The core fit function
mcglm
is employed for fitting a set of models. In this
introductory vignette we restrict ourselves to model independent data,
although a simple model for longitudinal data analysis in the Gaussian
case is also presented. We present models to deal with continuous,
binomial/bounded and count univariate response variables. We explore the
specification of different link, variance and covariance functions.
Consider a simple regression model, for univariate and independent Gaussian data: \[Y \sim N(X \beta, \tau_0 Z_0).\]
# Loading extra packages
require(mcglm)
require(Matrix)
require(mvtnorm)
require(tweedie)
# Setting the seed
set.seed(2503)
# Fixed component
x1 <- seq(-1,1, l = 100)
X <- model.matrix(~ x1)
mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL,
link = "identity")
# Random component
y1 <- rnorm(100, mu1$mu, sd = 0.5)
# Data structure
data <- data.frame("y1" = y1, "x1" = x1)
# Matrix linear predictor
Z0 <- mc_id(data)
# Fit
fit1.id <- mcglm(linear_pred = c(y1 ~ x1),
matrix_pred = list(Z0),
data = data)
## Automatic initial values selected.
The mcglm
package offers the following set of
S3-methods
for model summarize.
print(methods(class = "mcglm"))
## [1] anova coef confint fitted plot print
## [7] residuals summary vcov
## see '?methods' for accessing help and source code
The traditional summary
for a fitted model can be
obtained by
summary(fit1.id)
## Call: y1 ~ x1
##
## Link function: identity
## Variance function: constant
## Covariance function: identity
## Regression:
## Estimates Std.error Z value Pr(>|z|)
## (Intercept) 1.0029066 0.05383753 18.62839 1.891184e-77
## x1 0.9488496 0.09232145 10.27767 8.884812e-25
##
## Dispersion:
## Estimates Std.error Z value Pr(>|z|)
## 1 0.2898479 0.04137255 7.005803 2.455725e-12
##
## Algorithm: chaser
## Correction: TRUE
## Number iterations: 2
The function summary.mcglm
was designed to be similar to
the native summary
functions for the classes
lm
and glm
. Extra features were included to
describe the dispersion structure. The mean formula call, along with the
the link
, variance
and covariance
functions are presented. The parameter estimates are presented in two
blocks, the first presents the regression estimates while the second
presents the dispersion estimates. In both cases the parameter estimates
are summarized by point estimates, standard errors, Z-values and
p-values associated with the Wald test whose null hypothesis is defined
as \(\beta = 0\) and \(\tau = 0\) for the mean and dispersion
structures, respectively. Finally, the selected algorithm, if the
correction term is employed or not and the number of iterations is
printed.
The same linear regresion model can be fitted by using a different covariance link function, for example the inverse covariance link function.
# Fit using inverse covariance link function
fit1.inv <- mcglm(linear_pred = c(y1 ~ x1),
matrix_pred = list(Z0),
covariance = "inverse", data = data)
## Automatic initial values selected.
Furthermore, we can use a less conventional covariance link function, as the exponential-matrix.
# Fit using expm covariance link function
fit1.expm <- mcglm(linear_pred = c(y1 ~ x1),
matrix_pred = list(Z0),
covariance = "expm", data = data)
## Automatic initial values selected.
The function mcglm
returns an object of mcglm class, for
which we can use the method coef
to extract the parameters
estimates.
# Comparing estimates using different covariance link functions
cbind(coef(fit1.id)$Estimates,
coef(fit1.inv)$Estimates,
coef(fit1.expm)$Estimates)
## [,1] [,2] [,3]
## [1,] 1.0029066 1.0029066 1.0029066
## [2,] 0.9488496 0.9488496 0.9488496
## [3,] 0.2898479 3.4500851 -1.2383989
# Applying the inverse transformation
c(coef(fit1.id)$Estimates[3],
1/coef(fit1.inv)$Estimates[3],
exp(coef(fit1.expm)$Estimates[3]))
## [1] 0.2898479 0.2898479 0.2898479
Consider an extension of the linear regression models to deal with heteroscedasticity:
\[ Y \sim N(X \beta, \tau_0 Z_0 + \tau_1
Z_1),\] where \(Z_0\) is a
identity matrix and \(Z_1\) is a
diagonal matrix whose elements are given by the values of a known
covariate. Such a model, can be fitted easily using the
mcglm
package.
# Mean model
set.seed(1811)
x1 <- seq(-1,1, l = 100)
X <- model.matrix(~ x1)
mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL,
link = "identity")
z1 <- rnorm(100, mean = 0, sd = 0.25)
data <- data.frame("id" = 1, "x1" = x1, "z1" = z1)
# Matrix linear predictor
Z <- mc_dglm(~ z1, id = 'id', data = data)
# Covariance model
Sigma <- mcglm::mc_matrix_linear_predictor(tau = c(0.2, 0.15), Z = Z)
# Simulating the response variable
y1 <- rnorm(100, mu1$mu, sd = sqrt(diag(Sigma)))
data$y <- y1
# Fitting
fit2.id <- mcglm(linear_pred = c(y1 ~ x1), matrix_pred = list(Z), data = data)
## Automatic initial values selected.
We can also extend the linear regression model to deal with longitudinal data analysis. The code below presents an example of such a model.
# Mean model
x1 <- seq(-1,1, l = 100)
X <- model.matrix(~ x1)
mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL,
link = "identity")
# Data structure
data <- data.frame("id" = as.factor(rep(1:10, each = 10)), "x1" = x1)
# Covariance model
Z0 <- mc_id(data)
Z1 <- mc_mixed(~ 0 + id, data = data)
Sigma <- mcglm::mc_matrix_linear_predictor(tau = c(0.2, 0.15), Z = c(Z0,Z1))
# Simulating the Response variable
y1 <- as.numeric(rmvnorm(1, mean = mu1$mu, sigma = as.matrix(Sigma)))
data <- data.frame("y1" = y1, "x1" = x1)
# Fit
fit3.id <- mcglm(linear_pred = c(y1 ~ x1),
matrix_pred = list("resp1" = c(Z0,Z1)), data = data)
## Automatic initial values selected.
The model summary
summary(fit3.id)
## Call: y1 ~ x1
##
## Link function: identity
## Variance function: constant
## Covariance function: identity
## Regression:
## Estimates Std.error Z value Pr(>|z|)
## (Intercept) 0.8547952 0.05768440 14.818482 1.112705e-49
## x1 0.6100739 0.09849275 6.194099 5.861920e-10
##
## Dispersion:
## Estimates Std.error Z value Pr(>|z|)
## 1 0.1775319 0.02291873 7.746151 9.471961e-15
## 2 0.0155217 0.01511449 1.026942 3.044476e-01
##
## Algorithm: chaser
## Correction: TRUE
## Number iterations: 5
Note that, the dispersion structure now has two parameters. In that case, the parameter \(\tau_1\) represents the longitudinal structure for which we are assuming a compound symmetry model. This model is an equivalent to a random intercept model in the context of Linear Mixed Models (LMMs).
The mcglm
package offers a rich set of models to deal
with binomial and bounded response variables. The logit
,
probit
, cauchit
, cloglog
, and
loglog
link functions along with the extended binomial
variance function combined with the linear covariance structure, provide
a flexible class of models for handling binomial and bounded response
variables. The extended binomial variance function is given by \(\mu^p (1- \mu)^q\) where the two extra
power parameters offer more flexibility to model the relationship
between mean and variance. Consider the following simulated dataset.
# Mean model
x1 <- seq(-1,1, l = 500)
X <- model.matrix(~ x1)
mu1 <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL, link = "logit")
# Data structure
data <- data.frame("x1" = x1)
# Covariance model
Z0 <- mc_id(data)
# Simulating the response variable
set.seed(123)
data$y <- rbinom(500, prob = mu1$mu, size = 10)/10
The most traditional regression model to deal with binomial data is
the logistic regression model that can be fitted using the
mcglm
package using the following code:
# Fit
fit4.logit <- mcglm(linear_pred = c(y ~ x1),
matrix_pred = list(Z0),
link = "logit", variance = "binomialP",
power_fixed = TRUE,
Ntrial = list(rep(10,500)), data = data)
## Automatic initial values selected.
It is important to highlight that the response variable collumn
should be between \(0\) and \(1\) and in the case of more than one trial
the argument Ntrial
should be used for fitting the model.
The argument link
specifies the link function whereas the
argument variance
specifies the variance function in that
case binomialP
. The variance function
binomialP
represents a simplification of the extended
binomial variance function given by \(\mu^p
(1- \mu)^p\). Note that, in this example the argument
‘power_fixed = TRUE’ specifies that the power parameter \(p\) will not be estimated, but fixed at the
initial value \(p = 1\) corresponding
to the orthodox binomial variance function. We can easily fit the model
using a different link function, for example the
cauchit
.
fit4.cauchit <- mcglm(linear_pred = c(y ~ x1),
matrix_pred = list(Z0),
link = "cauchit", variance = "binomialP",
Ntrial = list(rep(10,250)), data = data)
## Automatic initial values selected.
We can also estimate the extra power parameter \(p\).
fit4.logitP <- mcglm(linear_pred = c(y ~ x1),
matrix_pred = list(Z0),
link = "logit", variance = "binomialP",
power_fixed = FALSE,
Ntrial = list(rep(10,500)), data = data)
## Automatic initial values selected.
Furthermore, we can estimate the two extra power parameters involved in the extended binomial variance function.
fit4.logitPQ <- mcglm(linear_pred = c(y ~ x1),
matrix_pred = list(Z0),
link = "logit", variance = "binomialPQ",
power_fixed = FALSE,
Ntrial = list(rep(10,500)),
control_algorithm = list(tuning = 0.5, max_iter = 100),
data = data)
## Automatic initial values selected.
The estimation of the extra power parameters involved in the extended
binomial variance function is challenging mainly for small data sets.
Note that, in this simulated example, we have to control the step-length
of the chaser
algorithm to avoid unrealistic values for the
parameters involved in the dispersion structure. To do that, we used the
extra argument control_algorithm
that should be a named
list. For a detailed description of the arguments that can be passed to
the control_algorithm
function see
?fit_mcglm
.
The analysis of count data in the mcglm
package relies
on the structure of the Poisson-Tweedie distribution. Such a
distribution is characterized by the following dispersion function:
\[ \nu(\mu, p) = \mu + \tau_0 \mu^p. \] The power parameter is an index that identify different distributions, examples include the Hermite (\(p = 0\)), Neyman-Type A (\(p = 1\)) and the negative binomial (\(p = 2\)).
The orthodox Poisson model can be fitted using the mcglm
package using the Tweedie
variance function \(\nu(\mu, p ) = \mu^p\) where the power
parameter \(p\) is fixed at \(1\). For example,
# Mean model
x1 <- seq(-2,2, l = 200)
X <- model.matrix(~ x1)
mu <- mcglm::mc_link_function(beta = c(1,0.8), X = X, offset = NULL, link = "log")
# Data structure
data <- data.frame("x1" = x1)
# Covariance model
Z0 <- mc_id(data)
# Data structure
data$y <- rpois(200, lambda = mu$mu)
# Fit
fit.poisson <- mcglm(linear_pred = c(y ~ x1),
matrix_pred = list(Z0),
link = "log", variance = "tweedie",
power_fixed = TRUE, data = data)
## Automatic initial values selected.
Another very useful model for count data is the negative binomial.
# Simulating negative binomial models
set.seed(1811)
x <- rtweedie(200, mu = mu$mu, power = 2, phi = 0.5)
y <- rpois(200, lambda = x)
data <- data.frame("y1" = y, "x1" = x1)
fit.pt <- mcglm(linear_pred = c(y ~ x1), matrix_pred = list(Z0),
link = "log", variance = "poisson_tweedie",
power_fixed = FALSE, data = data)
## Automatic initial values selected.
summary(fit.pt)
## Call: y ~ x1
##
## Link function: log
## Variance function: poisson_tweedie
## Covariance function: identity
## Regression:
## Estimates Std.error Z value Pr(>|z|)
## (Intercept) 0.8551973 0.07095214 12.05316 1.866611e-33
## x1 0.9285076 0.06336780 14.65267 1.295191e-48
##
## Power:
## Estimates Std.error Z value Pr(>|z|)
## 1 2.233986 0.3291973 6.78616 1.151575e-11
##
## Dispersion:
## Estimates Std.error Z value Pr(>|z|)
## 1 0.2827984 0.1906259 1.483525 0.1379349
##
## Algorithm: chaser
## Correction: TRUE
## Number iterations: 9
Note that, we estimate the power parameter rather than fix it at \(p = 2\).