GLInvCI is a package that provides a framework for computing the maximum-likelihood estimates and asymptotic confidence intervals of a class of continuous-time Gaussian branching processes, including the Ornstein-Uhlenbeck branching process, which is commonly used in phylogenetic comparative methods. The framework is designed to be flexible enough that the user can easily specify their own parameterisation and obtain the maximum-likelihood estimates and confidence intervals of their own parameters.
The model in concern is the family of continuous-trait continuous-time Gaussian evolution processes along a known phylogeny, in which each species’ traits evolve independently of each others after branching off from their common ancestor and for every non-root node. Let k be a child node of i, and zₖ, zᵢ denotes the corresponding multivariate traits. We assume that zₖ|zᵢ is a Gaussian distribution with expected value wₖ+Φₖzᵢ and variance Vₖ, where the matrices (Φₖ,wₖ,Vₖ) are parameters independent of zₖ but can depend other parameters including tₖ. The traits zₖ and zᵢ can have different number of dimension.
The following command should install the latest version of the package:
install.packages('devtools')
devtools::install_url(
'https://git.sr.ht/~hckiang/glinvci/blob/latest-tarballs/glinvci_latest_main.tar.gz')
The package contains two levels of user interfaces. The high-level
interface, accessible through the glinv
function, provides
facilities for handling missing traits, lost traits, multiple
evolutionary regimes, and most importantly, the calculus chain rule. The
lower-level interface, accessible through the glinv_gauss
function, allows the users to operate purely in the (Φₖ,wₖ,Vₖ)ₖ
parameter space. The latter parameter model obviously has huge number of
parameters because k corresponds to all the nodes and tips (except the
root).
Most users should be satisfied with the high-level interface, even if they intend to write their own custom models.
To fit a model using this package, generally you will need two main pieces of input data: a rooted phylogenetic tree and a matrix of trait values. The phylogenetic tree can be non-ultrametric and can potentially contain multifurcation. The matrix of trait values should have the same number of columns as the number of tips.
For the purpose of demonstrating the functionality of the package, we will first generate a random tree.
library(glinvci)
set.seed(1)
ntips = 300 # No. of tips
k = 2 # No. of trait dimensions
tr = ape::rtree(ntips) # Random non-ultrametric tree
x0 = rnorm(k) # Root value
With the above material, we are ready to make a model object. We use the OU model as an example. The OU model is generally parameterised in three matrix parameters: the drift matrix H, the evolutionary optimum θ, and the symmetric positively definite Brownian motion covariance matrix Σ. In this example, we restrict H to be a symmetric positively definite matrix while leaving θ and Σ unrestricted:
repar = get_restricted_ou(H='logspd', theta=NULL, Sig=NULL, lossmiss=NULL)
mod = glinv(tr, x0, X=NULL, repar = repar)
print(mod)
# An alternative way to make the model is:
# mod = glinv(tr, x0, X=NULL,
# pardims = repar$nparams(k),
# parfns = repar$par,
# parjacs = repar$jac,
# parhess = repar$hess)
A GLInv model with 1 regimes and 8 parameters divided into 1 blocks,
all of which are associated to the only one existing regime.
The phylogeny has 300 tips and 299 internal nodes. Tip values are empty,
meaning that `fit()` and `lik()` etc. will not work. See `?set_tips`.
The first line above constructs a “re-parameterization object”
repar
, in which 'logspd'
means that H should
be symmetric positively definite with its diagonals parametrized in its
log scale. There are 8 parameters because the symmetric positive
definite matrix H corresponds to 3 parameters (instead of 4 because of
symmetry), θ has 2 elements, and Σ has 3 parameters, again, because of
symmetry.
Notice that we don’t have any trait vectors yet, and this is why the package says you cannot fit the model nor compute the likelihood of the model. To be able to fit the model, of course, we need some trait values as our data, and now we will generate some random trait vectors from the model and use it to fit the model.
But before we can generate data, we need to make some ground-truth parameters:
H = matrix(c(1,0,0,1), k)
theta = c(0,0)
sig = matrix(c(0.5,0,0,0.5), k)
sig_x = t(chol(sig))
diag(sig_x) = log(diag(sig_x)) # Pass the diagonal to log
sig_x = sig_x[lower.tri(sig_x,diag=T)] # Trim away upper-tri. part and flatten.
In the above, the first three lines defines the actual parameters
that we want, but notice that we performed a Cholesky decomposition on
sig_x
and took the logarithm of the diagonal. The
package always accept the variance-covariance matrix of the Brownian
motion term in this form, unless it is restricted to be a diagonal
matrix. The Cholesky decomposition ensures that, during numerical
optimisation in the model fitting, the matrix remain positively
definite; and logarithm further constrain the diagonal of the Cholesky
factor to be positive, hence eliminating multiple optima.
Because we have also constrained H
to be symmetric
positively definite (by passing H='logspd'
to
get_restricted_ou
), we need to transform H
in
the same manner:
H_input = t(chol(H))
diag(H_input) = log(diag(H_input))
H_input = H_input[lower.tri(H_input,diag=T)]
This transformation depends on how you restrict your H
matrix. For example, if you do not put any constrains on H
,
by passing H=NULL
to get_restricted_ou
, the
above transformation is not needed.
Then we need to concatenate all parameters into a single vector
par_truth. All OU-related functions in the package assume that the
parameters are concatenated in the (H, theta, sig_x)
order,
as follows:
par_truth = c(H=H_input,theta=theta,sig_x=sig_x)
Now let’s simulate the some trait data and add the these trait data
into the model object mod
. The first line below simulates a
set of trait data using the parameters par_truth. The second line adds
the simulated data into the mod object.
X = rglinv(mod, par_truth, Nsamp=1)
set_tips(mod, X[[1]])
print(mod)
A GLInv model with 1 regimes and 8 parameters divided into 1 blocks,
all of which are associated to the only one existing regime.
The phylogeny has 300 tips and 299 internal nodes. Tip values are already set.
As you can see, after calling set_tips
, the warning that
fit
etc. will not work is gone.
Now let’s compute the likelihood, gradient, and Hessian of this model.
cat('Ground-truth parameters:\n')
print(par_truth)
cat('Likelihood:\n')
print(lik(mod)(par_truth))
cat('Gradient:\n')
print(grad(mod)(par_truth))
cat('Hessian:\n')
print(hess(mod)(par_truth))
Ground-truth parameters:
H1 H2 H3 theta1 theta2 sig_x1 sig_x2 sig_x3
0.00 0.00 0.00 0.00 0.00 -0.35 0.00 -0.35
Likelihood:
[1] -400
Gradient:
[1] 4.3 -6.2 -27.7 -9.9 -14.9 -12.8 -5.9 35.2
Hessian:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] -545 -29.4 0.0 99 0 432.2 13.8 0.0
[2,] -29 -289.0 -23.1 -13 50 2.7 328.3 9.7
[3,] 0 -23.1 -546.9 0 -26 0.0 3.9 496.3
[4,] 99 -13.0 0.0 -505 0 19.8 21.1 0.0
[5,] 0 49.5 -26.0 0 -505 0.0 14.0 29.9
[6,] 432 2.7 0.0 20 0 -574.3 5.9 0.0
[7,] 14 328.3 3.9 21 14 5.9 -574.3 11.9
[8,] 0 9.7 496.3 0 30 0.0 11.9 -670.4
The maximum likelihood estimates can be obtained by calling the
fit.glinv
method. We use the zero vector as the
optimisation routine’s initialisation:
par_init = par_truth
par_init[] = 0.
fitted = fit(mod, par_init)
print(fitted)
$mlepar
H1 H2 H3 theta1 theta2 sig_x1 sig_x2 sig_x3
-0.0247 -0.1219 -0.0058 -0.0350 -0.0485 -0.3860 -0.0783 -0.2966
$score
H1 H2 H3 theta1 theta2 sig_x1 sig_x2 sig_x3
-0.001171 -0.000099 -0.000025 0.000358 0.000461 0.000175 -0.000051 0.000109
$loglik
[1] -398
$counts
[1] 31 31
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH"
Once the model is fitted, one can estimate the variance-covariance
matrix of the maximum-likelihood estimator using
varest
.
v_estimate = varest(mod, fitted)
The marginal confidence interval can be obtained by calling
marginal_ci
on the object returned by
varest
.
print(marginal_ci(v_estimate, lvl=0.99))
Lower Upper
H1 -0.21 0.161
H2 -0.42 0.172
H3 -0.23 0.214
theta1 -0.17 0.097
theta2 -0.18 0.083
sig_x1 -0.56 -0.215
sig_x2 -0.27 0.118
sig_x3 -0.49 -0.104
It seems that these confidence intervals is indeed covering the ground truth.
Let’s assume we have the same data k
, tr
,
X
, x0
as generated before but we fit a simple
Brownian motion model instead. To make a Brownian motion model, we can
call the following:
repar_brn = get_restricted_ou(H='zero', theta='zero', Sig=NULL, lossmiss=NULL)
mod_brn = glinv(tr, x0, X[[1]], repar=repar_brn)
print(mod_brn)
A GLInv model with 1 regimes and 3 parameters divided into 1 blocks,
all of which are associated to the only one existing regime.
The phylogeny has 300 tips and 299 internal nodes. Tip values are already set.
As you may might have already guessed, H='zero'
above
means that we restrict the drift matrix term of the OU to be a zero
matrix. In this case, theta
, the evolutionary optimum, is
meaningless. In this case, the package has a convention that in a
Brownian motion we always have H='zero', theta='zero'
.
The following calls demonstrates how to compute the likelihood:
par_init_brn = c(sig_x=sig_x)
print(c(likelihood=lik(mod_brn)(par_init_brn)))
likelihood
-536
The user can obtain the an MLE fit by calling
fit(mod_brn, par_init_brn)
. The marginal CI and the
estimator’s variance can be obtained in exactly the same way as in the
OU example. But just in this example, keep in mind that we have
previously generated X
using an OU process rather than from
a Brownian motion.
Out of box, the package allows missing data in the tip trait matrix, as well as allowing multiple revolutionary regimes.
A ‘missing’ trait refers to a trait value whose data is missing due
to data collection problems. Fundamentally, they evolves in the same
manner as other traits. An NA
entry in the trait matrix
X
is deemed ‘missing’. A lost trait is a trait dimension
which had ceased to exists during the evolutionary process. An
NaN
entry in the data indicates a ‘lost’ trait. The package
provides two different ways of handling lost traits. For more details
about how missingness is handled, the users should read
?ou_haltlost
.
In this example, we demonstrate how to fit a model with two regimes,
and some missing data and lost traits. Assume the phylogeny is the same
as before but some entries of X
is NA
or
NaN
. First, let’s arbitrarily set some entries of
X
to missingness, just for the purpose of
demonstration.
X[[1]][2,c(1,2,80,95,130)] = NA
X[[1]][1,c(180:200)] = NaN
The following call constructs a model object in which three evolutionary regimes are present: the first regime starts at the root and it has a Brownian motion evolution; the second regime starts at the beginning of the edge that leads to internal node number 390 and it has an OU process; and the third regime starts at the beginning of the edge that leads to internal node number 502 and it has an OU process that shares the same underlying ground-truth parameters as the second regime. In other words, there are three blocks of parameters. In a binary tree with N tips, the root’s node number is N+1; in other words, in our case, the root node number is 301. The following code constructs such a model:
repar_a = get_restricted_ou(H='logdiag', theta=NULL, Sig=NULL, lossmiss='halt')
repar_b = get_restricted_ou(H='zero', theta='zero', Sig=NULL, lossmiss='halt')
mod_tworeg = glinv(tr, x0, X[[1]], repar=list(repar_a, repar_b),
regimes = list(c(start=301, fn=2),
c(start=390, fn=1),
c(start=502, fn=1)))
# A long-form alternative way to do the same is this:
# mod_tworeg = glinv(tr, x0, X[[1]],
# pardims = list(repar_a$nparams(k), repar_b$nparams(k)),
# parfns = list(repar_a$par, repar_b$par),
# parjacs = list(repar_a$jac, repar_b$jac),
# parhess = list(repar_a$hess, repar_b$hess),
# regimes = list(c(start=301, fn=2),
# c(start=390, fn=1),
# c(start=502, fn=1)))
print(mod_tworeg)
A GLInv model with 3 regimes and 10 parameters divided into 2 blocks, whose
1-7th parameters are asociated with regime no. {2,3};
8-10th parameters are asociated with regime no. {1},
where
regime #1 starts from the branch that leads to node #301, which is the root;
regime #2 starts from the branch that leads to node #390;
regime #3 starts from the branch that leads to node #502.
The phylogeny has 300 tips and 299 internal nodes. Tip values are already set.
In the above, we have defined three regimes and two stochastic
processes. The repar
argument, or alternatively the
pardims
, parfns
, parjacs
, and
parhess
argument, specifies the two stochastic processes
and the regime
parameter can be thought of as ‘drawing the
lines’ to match each regime to a seperately defined stochastic process.
The start
element in the list specifies the node number at
which a regime starts, and the fn
element is an index to
the list passed to repar
. In this example, the first regime
starts at the root and uses repar_b
. If multiple regimes
share the same fn
index then it means that they shares both
the underlying stochastic process and the parameters.
lossmiss='halt'
specifies how the lost traits (the
NaN
) are handled.
To compute the likelihood and initialize for optimisation, one need
to notice the input parameters’ format. When repar
(or
alternatively parfns
etc.) has more than one elements, the
parameter vector that lik
and fit
etc. accept
is simply assumed to be the concatenation of all of its elements’
parameters. The following example should illustrate this.
logdiagH = c(0,0) # Meaning H is the identity matrix
theta = c(1,0)
Sig_x_ou = c(0,0,0) # Meaning Sigma is the identity matrix
Sig_x_brn = c(0,0,0) # The Brownian motion's parameters
print(lik(mod_tworeg)(c(logdiagH, theta, Sig_x_ou, Sig_x_brn)))
[1] -623
To write custom models, we cannot use the
glinv(..., repar=repar)
shortcut anymore. Instead, one
should use the parfns
, parjacs
, and
parhess
arguments.
It is important to note that the parfns
,
parjacs
, and parhess
arguments to
glinv()
are simply R functions, which the user can either
create themselves or obtain from calling
get_restricted_ou()
, which is simply a convenient helper
function for making the likelihood, Jacobian, Hessian functions
automatically. Rather than writing all these from scratch by yourself,
it is often possible to customize a model is to take the functions
returned by get_restricted_ou()
and extending them. In this
example, we familiarize ourselves with these functions’s input and
output format and write a custom OU model with diagonal drift matrix and
a diagonal additive measurement error on each tips. The additive
measurement error could be estimated together if one wish (but
estimating all these together may require a fairly large tree).
Nonetheless, to investigate the format of the mentioned three
functions, first, we obtain the reparametrization likelihood, Jacobian,
Hessian, and number of parameter functions using
get_restricted_ou()
:
repar = get_restricted_ou(H='diag', theta=NULL, Sig=NULL, lossmiss='halt')
print(sapply(repar, class))
par jac hess nparams
"function" "function" "function" "function"
We will deal with nparams later and let’s look at the other three
first. Recall that in the long-form calls in the previous example,
repar$par
is always passed to parfns
,
repar$jac
always to parjacs
and
repar$hess
to parhess
.Mathematically, the
three functions map the OU process parameters to (Φₖ,wₖ,Vₖ)ₖ, where k
are the nodes. The input format of all three of them are the same. They
accepts four parameters, and as an example, we could call
print(repar$par(c(1,1,0,0,0,0,0), 0.1, c('OK','OK'), c('OK','OK')))
[1] 0.905 0.000 0.000 0.905 0.000 0.000 0.091 0.000 0.091
In the call above:
The first argument passed to repar$par
is the
parameters of the OU model, with H being the identity matrix,
represented by (1,1), the evolutionary optimum θ being the 2D zero
vector, represented by (0,0), and Σ being the identity matrix,
represented by (0,0,0). Therefore concatenated together we have
c(1,1,0,0,0,0,0)
.
The second argument is the branch length leading to node k.
The third argument is a vector of factors or string with three
levels OK
, LOST
, MISSING
,
indicating which dimensions are missing or lost in the mother node of
node k. In our case, the length of this vector is two because the we
have two trait dimensions; two OK
’s means that both the
traits are “normal”, neither missing nor lost.
The fourth argument is a vector of factors or string with the same three levels indicating the missingness of the node k. The format is the same as the third argument.
The return value is a concatenation of (Φₖ,wₖ,Vₖ), flattened in column-major order, which is the R default. This means that Φ is 0.905 times the identity matrix; w is the 2D zero vector and V is 0.091 times the identity matrix.
The repar$jac
function simply returns the Jacobian
matrix of repar$par
:
print(repar$jac(c(1,1,0,0,0,0,0), 0.1, c('OK','OK'), c('OK','OK')))
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -0.0905 0.0000 0.000 0.000 0.00 0.000 0.00
[2,] 0.0000 0.0000 0.000 0.000 0.00 0.000 0.00
[3,] 0.0000 0.0000 0.000 0.000 0.00 0.000 0.00
[4,] 0.0000 -0.0905 0.000 0.000 0.00 0.000 0.00
[5,] 0.0000 0.0000 0.095 0.000 0.00 0.000 0.00
[6,] 0.0000 0.0000 0.000 0.095 0.00 0.000 0.00
[7,] -0.0088 0.0000 0.000 0.000 0.18 0.000 0.00
[8,] 0.0000 0.0000 0.000 0.000 0.00 0.091 0.00
[9,] 0.0000 -0.0088 0.000 0.000 0.00 0.000 0.18
The repar$hess
function also accepts the same argument
but its return values have a slightly different format:
tmp = repar$hess(c(1,1,0,0,0,0,0), 0.1, c('OK','OK'), c('OK','OK'))
print(names(tmp))
[1] "V" "w" "Phi"
print(sapply(tmp, dim))
V w Phi
[1,] 3 2 4
[2,] 7 7 7
[3,] 7 7 7
Notice that repar$hess
returns a list containing three
elements, V
, w
, and Phi
, each
being a three-dimensional array. They contain all the second-order
partial derivatives of the repar$par
function, with
tmp$V[m,i,j]
containing ∂²Vₘ/∂ηᵢ∂ηⱼ,
tmp$w[m,i,j]
containing ∂²wₘ/∂ηᵢ∂ηⱼ and
tmp$Phi[m,i,j]
containing ∂²Φₘ/∂ηᵢ∂ηⱼ, where η denotes the
vector of parameters that repar$par
accepts and m means the
index of the matrices but not the node numbers. For example, in our
situation, tmp$w[2,3,4]
contains ∂²w₂/∂θ₁∂θ₂ and
tmp$Phi[3,2,3]
is ∂²Φ₂₁/∂ H₂₂∂θ₁.
Having understood their input and output, we are now ready to make a
custom model. In this custom model, we assume that all species evolve
exactly the same as specified in repar, but we cannot measure the traits
at the tip accurately. To take into account this measurement error, we
add an uncorrelated Gaussian error at each tip. We assume that the tree
has a 800 tips in this example for simplicity. First, we extend
repar$par
to accept our additional parameters:
my_par = function (par, ...) {
phiwV = repar$par(par[1:7], ...)
if (INFO__$node_id > 800) # If not tip just return the original
return(phiwV)
Sig_e = diag(par[8:9]) # Our measurement error matrix
phiwV[7:9] = phiwV[7:9] + Sig_e[lower.tri(Sig_e, diag=T)]
phiwV
}
Note that we have accessed the node ID using
INFO__$node_id
. In our package, “node IDs” means the same
thing as the node numbers in the ape
package, hence the
nodes with ID 1-300 are the tips and the rest are the internal nodes.
The INFO__
object is neither a global variable nor an
argument but a variable that lives in function’s enclosing environment.
Now let’s define the Jacobian function, which contains the same Jacobian
as the original no-measurement-error Jacobian, but with some extra
entries that are either one or zero.
my_jac = function (par, ...) {
new_jac = matrix(0.0, 9, 9)
new_jac[,1:7] = repar$jac(par[1:7], ...)
if (INFO__$node_id <= 800)
new_jac[7,8] = new_jac[9,9] = 1.0
new_jac
}
The Hessian matrix of our modified model is actually unchanged except that there are more zero entries, because the new parameters are simply a linear sum.
my_hess = function (par, ...)
lapply(repar$hess(par[1:7], ...), function (H) {
newH = array(0.0, dim=c(dim(H)[1], 9, 9))
newH[,1:7,1:7] = H[,,] # Copy the original part
newH # Other entries are just zero
})
Finally, we actually do not need to write our own
repar$nparams
, which accepts the number of trait dimensions
and returns the number of parameters, beacuase we know exactly we have 9
parameters in our example. Now we can construct our custom model:
# Simulate a tree with 800 tips
set.seed(777)
tr = ape::rtree(800)
mod_measerr = glinv(tr, x0, NULL,
pardims = 9,
parfns = my_par,
parjacs = my_jac,
parhess = my_hess)
print(mod_measerr)
Now let’s make a ground truth parameter value, generate some random data and fit the model:
par_measerr_truth = c(H1=0.2, H2=0.5,
theta1=-1, theta2=1,
sig_x1=0, sig_x2=0, sig_x3=0,
sig_e1=0.4, sig_e2=0.8)
set.seed(999)
X_measerr = rglinv(mod_measerr, par_measerr_truth, Nsamp=1)
set_tips(mod_measerr, X_measerr[[1]])
## Fit the model
par_measerr_init = par_measerr_truth
par_measerr_init[] = 1.0
fitted_measerr = fit(mod_measerr, par_measerr_init,
method='BB', ## Try out different opt. methods
lower=c(H1=-Inf, H2=-Inf,
theta1=-Inf, theta2=-Inf,
sig_x1=-Inf, sig_x2=-Inf, sig_x3=-Inf,
sig_e1=1e-9, sig_e2=1e-9))
vest_measerr = varest(mod_measerr, fitted_measerr$mlepar)
confint = marginal_ci(vest_measerr, lvl=0.95)
cat('-- ESTIMATES --\n')
print(fitted_measerr)
cat('-- CONF. INTERVALS --\n')
print(confint)
-- ESTIMATES --
$mlepar
H1 H2 theta1 theta2 sig_x1 sig_x2 sig_x3 sig_e1 sig_e2
0.1467 0.3374 -1.8662 0.9153 -0.0407 -0.0039 -0.2291 0.4107 0.9276
$loglik
[1] -2614
$fn.reduction
[1] 498
$iter
[1] 207
$feval
[1] 208
$convergence
[1] 0
$message
[1] "Successful convergence"
$cpar
method M
2 50
$score
[1] -0.000042 -0.000207 -0.000413 0.000126 -0.000024 0.000062 -0.000058
[8] 0.000023 -0.000348
-- CONF. INTERVALS --
Lower Upper
H1 0.07 0.22
H2 0.14 0.54
theta1 -3.06 -0.67
theta2 0.62 1.21
sig_x1 -0.19 0.11
sig_x2 -0.11 0.10
sig_x3 -0.56 0.11
sig_e1 0.25 0.57
sig_e2 0.69 1.17