BCClong
is an R package for performing Bayesian
Consensus Clustering (BCC) model for clustering continuous, discrete and
categorical longitudinal data, which are commonly seen in many clinical
studies. This document gives a tour of BCClong package.
see help(package = "BCClong")
for more information and
references provided by citation("BCClong")
To download BCClong, use the following commands:
require("devtools")
::install_github("ZhiwenT/BCClong", build_vignettes = TRUE)
devtoolslibrary("BCClong")
To list all functions available in this package:
ls("package:BCClong")
Currently, there are 5 function in this package which are BCC.multi, BayesT, model.selection.criteria, traceplot, trajplot.
BCC.multi function performs clustering on mixed-type (continuous, discrete and categorical) longitudinal markers using Bayesian consensus clustering method with MCMC sampling and provide a summary statistics for the computed model. This function will take in a data set and multiple parameters and output a BCC model with summary statistics.
BayesT function assess the model goodness of fit by calculate the discrepancy measure T(, ) with following steps (a) Generate T.obs based on the MCMC samples (b) Generate T.rep based on the posterior distribution of the parameters (c) Compare T.obs and T.rep, and calculate the P values.
model.selection.criteria function calculates DIC and WAIC for the fitted model traceplot function visualize the MCMC chain for model parameters trajplot function plot the longitudinal trajectory of features by local and global clustering
In this example, the epileptic.qol
data set from
joinrRML
package was used. The variables used here include
anxiety score
, depress score
and
AEP score
. All of the variables are continuous.
library(BCClong)
library(joineRML)
library(ggplot2)
library(cowplot)
# import data from joineRML library (use ?epileptic.qol to see details)
data(epileptic.qol)
# convert days to months
$time_month <- epileptic.qol$time/30.25
epileptic.qol# Sort by ID and time
<- epileptic.qol[order(epileptic.qol$id,epileptic.qol$time_month),]
epileptic.qol
## Make Spaghetti Plots to Visualize
<- ggplot(data =epileptic.qol, aes(x =time_month, y = anxiety, group = id))+
p1 geom_point() + geom_line() +
geom_smooth(method = "loess", size = 1.5,group =1,se = FALSE, span=2) +
theme(legend.position = "none",
plot.title = element_text(size = 20, face = "bold"),
axis.text=element_text(size=20),
axis.title=element_text(size=20),
axis.text.x = element_text(angle = 0 ),
strip.text.x = element_text(size = 20, angle = 0),
strip.text.y = element_text(size = 20,face="bold")) +
xlab("Time (months)") + ylab("anxiety")
<- ggplot(data =epileptic.qol, aes(x =time_month, y = depress, group = id))+
p2 geom_point() +
geom_line() +
geom_smooth(method = "loess", size = 1.5,group =1,se = FALSE, span=2) +
theme(legend.position = "none",
plot.title = element_text(size = 20, face = "bold"),
axis.text=element_text(size=20),
axis.title=element_text(size=20),
axis.text.x = element_text(angle = 0 ),
strip.text.x = element_text(size = 20, angle = 0),
strip.text.y = element_text(size = 20,face="bold")) +
xlab("Time (months)") + ylab("depress")
<- ggplot(data =epileptic.qol, aes(x =time_month, y = aep, group = id))+
p3 geom_point() +
geom_line() +
geom_smooth(method = "loess", size = 1.5,group =1,se = FALSE, span=2) +
theme(legend.position = "none",
plot.title = element_text(size = 20, face = "bold"),
axis.text=element_text(size=20),
axis.title=element_text(size=20),
axis.text.x = element_text(angle = 0 ),
strip.text.x = element_text(size = 20, angle = 0),
strip.text.y = element_text(size = 20,face="bold")) +
xlab("Time (months)") + ylab("aep")
plot_grid(p1,NULL,p2,NULL,p3,NULL,labels=c("(A)","", "(B)","","(C)",""), nrow = 1,
align = "v", rel_widths = c(1,0.1,1,0.1,1,0.1))
Spaghtti plot for each marker
$anxiety_scale <- scale(epileptic.qol$anxiety)
epileptic.qol$depress_scale <- scale(epileptic.qol$depress)
epileptic.qol$aep_scale <- scale(epileptic.qol$aep)
epileptic.qol<- epileptic.qol dat
We can compute the mean adjusted adherence to determine the number of clusters using the code below. Since this program takes a long time to run, this chunk of code will not run in this tutorial file.
# computed the mean adjusted adherence to determine the number of clusters
set.seed(20220929)
<- NULL
alpha.adjust <- WAIC <- NULL
DIC for (k in 1:5){
<- BCC.multi (
fit.BCC mydat = list(dat$anxiety_scale,dat$depress_scale,dat$aep_scale),
dist = c("gaussian"),
id = list(dat$id),
time = list(dat$time),
formula =list(y ~ time + (1 |id)),
num.cluster = k,
initials= NULL,
print.info="FALSE",
burn.in = 1000,
thin = 10,
per = 100,
max.iter = 2000)
<- c(alpha.adjust, fit.BCC$alpha.adjust)
alpha.adjust <- model.selection.criteria(fit.BCC, fast_version=0)
res <- c(DIC,res$DIC)
DIC <- c(WAIC,res$WAIC)}
WAIC
<- 1:5
num.cluster par(mfrow=c(1,3))
plot(num.cluster[2:5], alpha.adjust, type="o",cex.lab=1.5,cex.axis=1.5,cex.main=1.5,lwd=2,
xlab="Number of Clusters",
ylab="mean adjusted adherence",main="mean adjusted adherence")
plot(num.cluster, DIC, type="o",cex=1.5, cex.lab=1.5,cex.axis=1.5,cex.main=1.5,lwd=2,
xlab="Number of Clusters",ylab="DIC",main="DIC")
plot(num.cluster, WAIC, type="o",cex=1.5, cex.lab=1.5,cex.axis=1.5,cex.main=1.5,lwd=2,
xlab="Number of Clusters",ylab="WAIC",main="WAIC")
Here, We used gaussian distribution for all three markers. The number of clusters was set to 2 because it has highest mean adjusted adherence. All hyper parameters were set to default.
For the purpose of this tutorial, the MCMC iteration will be set to a
small number to minimize the compile time and the result will be read
from the pre-compiled RDS file.(The pre-compiled data file can be found
here (./inst/extdata/epil*.rds
))
# Fit the final model with the number of cluster 2 (largest mean adjusted adherence)
set.seed(20220929)
<- BCC.multi (
fit.BCC2 mydat = list(dat$anxiety_scale,dat$depress_scale,dat$aep_scale),
dist = c("gaussian"),
id = list(dat$id),
time = list(dat$time),
formula =list(y ~ time + (1|id)),
num.cluster = 2,
print.info="FALSE",
burn.in = 10, # number of samples discarded
thin = 1, # thinning
per = 10, # output information every "per" iteration
max.iter = 30) # maximum number of iteration
set.seed(20220929)
<- BCC.multi (
fit.BCC2b mydat = list(dat$anxiety_scale,dat$depress_scale,dat$aep_scale),
dist = c("gaussian"),
id = list(dat$id),
time = list(dat$time),
formula =list(y ~ time + (1 + time|id)),
num.cluster = 2,
print.info="FALSE",
burn.in = 10,
thin = 1,
per = 10,
max.iter = 30)
set.seed(20220929)
<- BCC.multi (
fit.BCC2c mydat = list(dat$anxiety_scale,dat$depress_scale,dat$aep_scale),
dist = c("gaussian"),
id = list(dat$id),
time = list(dat$time),
formula =list(y ~ time + time2 + (1 + time|id)),
num.cluster = 2,
print.info="FALSE",
burn.in = 10,
thin = 1,
per = 10,
max.iter = 30)
Load the pre-compiled results
<- readRDS(file = "../inst/extdata/epil1.rds")
fit.BCC2 <- readRDS(file = "../inst/extdata/epil2.rds")
fit.BCC2b <- readRDS(file = "../inst/extdata/epil3.rds")
fit.BCC2c $cluster.global <- factor(fit.BCC2b$cluster.global,
fit.BCC2blabels=c("Cluster 1","Cluster 2"))
table(fit.BCC2$cluster.global, fit.BCC2b$cluster.global)
#>
#> Cluster 1 Cluster 2
#> 1 231 4
#> 2 14 291
$cluster.global <- factor(fit.BCC2c$cluster.global,
fit.BCC2clabels=c("Cluster 1","Cluster 2"))
table(fit.BCC2$cluster.global, fit.BCC2c$cluster.global)
#>
#> Cluster 1 Cluster 2
#> 1 228 7
#> 2 6 299
To print the summary statistics for all parameters
$summary.stat fit.BCC2
To print the proportion for each cluster (mean, sd, 2.5% and 97.5% percentile) geweke statistics (geweke.stat) between -2 and 2 suggests the parameters converge
$summary.stat$PPI fit.BCC2
The code below prints out all major parameters
print(fit.BCC2$N)
#> [1] 540
print(fit.BCC2$summary.stat$PPI)
#> [,1] [,2]
#> mean 0.43791938 0.56208062
#> sd 0.01845804 0.01845804
#> 2.5% 0.41071270 0.53151902
#> 97.5% 0.46848098 0.58928730
#> geweke.stat 0.32753256 -0.32753256
print(fit.BCC2$summary.stat$ALPHA)
#> [,1] [,2] [,3]
#> mean 0.96693800 0.92505013 0.925372568
#> sd 0.01272697 0.01195727 0.008556003
#> 2.5% 0.94624413 0.90531576 0.910213046
#> 97.5% 0.98628979 0.94417533 0.940983911
#> geweke.stat -0.36227985 -2.12568038 -0.600639851
print(fit.BCC2$summary.stat$GA)
#> [[1]]
#> , , 1
#>
#> [,1] [,2]
#> mean 0.87729642 -0.64459070
#> sd 0.01927627 0.01666083
#> 2.5% 0.85349682 -0.66739415
#> 97.5% 0.91306409 -0.61823693
#> geweke.stat -0.19942773 0.29949578
#>
#> , , 2
#>
#> [,1] [,2]
#> mean -1.074188e-04 -1.576562e-04
#> sd 7.759632e-05 5.338211e-05
#> 2.5% -2.203863e-04 -2.533276e-04
#> 97.5% 3.290686e-05 -8.484365e-05
#> geweke.stat -2.814307e+00 7.024379e-01
#>
#>
#> [[2]]
#> , , 1
#>
#> [,1] [,2]
#> mean 0.88166260 -0.5921965
#> sd 0.02881597 0.0219339
#> 2.5% 0.83165270 -0.6358678
#> 97.5% 0.93839108 -0.5559157
#> geweke.stat -1.50786195 -1.5791791
#>
#> , , 2
#>
#> [,1] [,2]
#> mean -7.530200e-05 -2.009088e-04
#> sd 7.605123e-05 6.210093e-05
#> 2.5% -1.841543e-04 -3.131674e-04
#> 97.5% 7.958110e-05 -1.112733e-04
#> geweke.stat -9.565325e-01 3.748582e-01
#>
#>
#> [[3]]
#> , , 1
#>
#> [,1] [,2]
#> mean 0.79904742 -0.71860049
#> sd 0.01328303 0.01553592
#> 2.5% 0.78206752 -0.74887774
#> 97.5% 0.82017453 -0.69286688
#> geweke.stat -5.73392753 -1.63767793
#>
#> , , 2
#>
#> [,1] [,2]
#> mean 5.700862e-05 -2.019033e-04
#> sd 3.521174e-05 2.627128e-05
#> 2.5% 5.222274e-06 -2.475186e-04
#> 97.5% 1.338561e-04 -1.615889e-04
#> geweke.stat 1.888959e+00 -2.243578e+00
print(fit.BCC2$summary.stat$SIGMA.SQ.U)
#> [[1]]
#> , , 1
#>
#> [,1]
#> mean 2.936789e-05
#> sd 3.400319e-05
#> 2.5% 1.309234e-05
#> 97.5% 1.228279e-04
#> geweke.stat 3.123410e+00
#>
#> , , 2
#>
#> [,1]
#> mean 2.667787e-05
#> sd 3.324741e-05
#> 2.5% 1.177405e-05
#> 97.5% 1.177560e-04
#> geweke.stat 2.894896e+00
#>
#>
#> [[2]]
#> , , 1
#>
#> [,1]
#> mean 2.472793e-05
#> sd 1.884985e-05
#> 2.5% 1.377823e-05
#> 97.5% 7.222922e-05
#> geweke.stat 3.155393e+00
#>
#> , , 2
#>
#> [,1]
#> mean 2.111768e-05
#> sd 1.730134e-05
#> 2.5% 1.125047e-05
#> 97.5% 6.841069e-05
#> geweke.stat 3.381524e+00
#>
#>
#> [[3]]
#> , , 1
#>
#> [,1]
#> mean 3.120109e-05
#> sd 3.311745e-05
#> 2.5% 1.271328e-05
#> 97.5% 1.121231e-04
#> geweke.stat 4.103521e+00
#>
#> , , 2
#>
#> [,1]
#> mean 3.214810e-05
#> sd 3.918276e-05
#> 2.5% 1.320204e-05
#> 97.5% 1.399506e-04
#> geweke.stat 3.717880e+00
print(fit.BCC2$summary.stat$SIGMA.SQ.E)
#> [[1]]
#> [,1] [,2]
#> mean 0.43798839 0.43798839
#> sd 0.02039686 0.02039686
#> 2.5% 0.40803863 0.40803863
#> 97.5% 0.47957980 0.47957980
#> geweke.stat 1.32597079 1.32597079
#>
#> [[2]]
#> [,1] [,2]
#> mean 0.47252164 0.47252164
#> sd 0.01239170 0.01239170
#> 2.5% 0.45219075 0.45219075
#> 97.5% 0.49519381 0.49519381
#> geweke.stat 0.06409189 0.06409189
#>
#> [[3]]
#> [,1] [,2]
#> mean 0.42012860 0.42012860
#> sd 0.01331219 0.01331219
#> 2.5% 0.39710081 0.39710081
#> 97.5% 0.44412052 0.44412052
#> geweke.stat 1.20076131 1.20076131
table(fit.BCC2$cluster.global)
#>
#> 1 2
#> 235 305
table(fit.BCC2$cluster.local[[1]])
#>
#> 1 2
#> 233 307
table(fit.BCC2$cluster.local[[2]])
#>
#> 1 2
#> 224 316
table(fit.BCC2$cluster.local[[3]])
#>
#> 1 2
#> 260 280
We can use the traceplot function to plot the MCMC process and the trajplot function to plot the trajectory for each feature.
#=====================================================#
# Trace-plot for key model parameters
#=====================================================#
traceplot(fit=fit.BCC2, parameter="PPI",ylab="pi",xlab="MCMC samples")
traceplot(fit=fit.BCC2, parameter="ALPHA",ylab="alpha",xlab="MCMC samples")
traceplot(fit=fit.BCC2,cluster.indx = 1, feature.indx=1,parameter="GA",ylab="GA",xlab="MCMC samples")
traceplot(fit=fit.BCC2,cluster.indx = 1, feature.indx=2,parameter="GA",ylab="GA",xlab="MCMC samples")
traceplot(fit=fit.BCC2,cluster.indx = 1, feature.indx=3,parameter="GA",ylab="GA",xlab="MCMC samples")
traceplot(fit=fit.BCC2,cluster.indx = 2, feature.indx=1,parameter="GA",ylab="GA",xlab="MCMC samples")
traceplot(fit=fit.BCC2,cluster.indx = 2, feature.indx=2,parameter="GA",ylab="GA",xlab="MCMC samples")
traceplot(fit=fit.BCC2,cluster.indx = 2, feature.indx=3,parameter="GA",ylab="GA",xlab="MCMC samples")
#=====================================================#
# Trajectory plot for features
#=====================================================#
<- trajplot(fit=fit.BCC2,feature.ind=1,
gp1 which.cluster = "local.cluster",
title= bquote(paste("Local Clustering (",hat(alpha)[1] ==.(round(fit.BCC2$alpha[1],2)),")")),
xlab="time (months)",ylab="anxiety",color=c("#00BA38", "#619CFF"))
<- trajplot(fit=fit.BCC2,feature.ind=2,
gp2 which.cluster = "local.cluster",
title= bquote(paste("Local Clustering (",hat(alpha)[2] ==.(round(fit.BCC2$alpha[2],2)),")")),
xlab="time (months)",ylab="depress",color=c("#00BA38", "#619CFF"))
<- trajplot(fit=fit.BCC2,feature.ind=3,
gp3 which.cluster = "local.cluster",
title= bquote(paste("Local Clustering (",hat(alpha)[3] ==.(round(fit.BCC2$alpha[3],2)),")")),
xlab="time (months)",ylab="aep",color=c("#00BA38", "#619CFF"))
<- trajplot(fit=fit.BCC2,feature.ind=1,
gp4 which.cluster = "global.cluster",
title="Global Clustering",xlab="time (months)",ylab="anxiety",color=c("#00BA38", "#619CFF"))
<- trajplot(fit=fit.BCC2,feature.ind=2,
gp5 which.cluster = "global.cluster",
title="Global Clustering",xlab="time (months)",ylab="depress",color=c("#00BA38", "#619CFF"))
<- trajplot(fit=fit.BCC2,feature.ind=3,
gp6 which.cluster = "global.cluster",
title="Global Clustering",
xlab="time (months)",ylab="aep",color=c("#00BA38", "#619CFF"))
library(cowplot)
plot_grid(gp1, gp2,gp3,gp4,gp5,gp6,
labels=c("(A)", "(B)", "(C)", "(D)", "(E)", "(F)"),
ncol = 3, align = "v" )
plot_grid(gp1,NULL,gp2,NULL,gp3,NULL,
NULL,gp5,NULL,gp6,NULL,
gp4,labels=c("(A)","", "(B)","","(C)","","(D)","","(E)","","(F)",""), nrow = 2,
align = "v", rel_widths = c(1,0.1,1,0.1,1,0.1))
The BayesT function will be used for
posterior check. Here we used the pre-compiled results, un-comment the
line res <- BayesT(fit=fit.BCC2)
to try your own. The
pre-compiled data file can be found here
(./inst/extdata/conRes.rds
) For this function, the p-value
between 0.3 to 0.7 was consider reasonable. In the scatter plot, the
data pints should be evenly distributed around y = x.
#res <- BayesT(fit=fit.BCC2)
<- readRDS(file = "../inst/extdata/conRes.rds")
res plot(log(res$T.obs),log(res$T.rep),xlim=c(8.45,8.7), cex=1.5,
ylim=c(8.45,8.7),xlab="Observed T statistics (in log scale)", ylab = "Predicted T statistics (in log scale)")
abline(0,1,lwd=2,col=2)
<- sum(res$T.rep > res$T.obs)/length(res$T.rep)
p.value
p.value#> [1] 0.55
$cluster.global <- factor(fit.BCC2$cluster.global,labels=c("Cluster 1","Cluster 2"))
fit.BCC2boxplot(fit.BCC2$postprob ~ fit.BCC2$cluster.global,ylim=c(0,1),xlab="",ylab="Posterior Cluster Probability")
sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22621)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] cowplot_1.1.1 ggplot2_3.4.0 joineRML_0.4.5 survival_3.4-0 nlme_3.1-160
#> [6] BCClong_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.2 LaplacesDemon_16.1.6 jsonlite_1.8.3
#> [4] splines_4.2.2 foreach_1.5.2 label.switching_1.8
#> [7] bslib_0.4.1 rngWELL_0.10-8 mixAK_5.5
#> [10] highr_0.9 stats4_4.2.2 yaml_2.3.6
#> [13] truncdist_1.0-2 randtoolbox_2.0.3 pillar_1.8.1
#> [16] lattice_0.20-45 quantreg_5.94 glue_1.6.2
#> [19] digest_0.6.30 minqa_1.2.5 colorspace_2.0-3
#> [22] htmltools_0.5.4 Matrix_1.5-1 lpSolve_5.6.17
#> [25] pkgconfig_2.0.3 SparseM_1.81 mvtnorm_1.1-3
#> [28] scales_1.2.1 lme4_1.1-31 MatrixModels_0.5-1
#> [31] tibble_3.1.8 combinat_0.0-8 mgcv_1.8-41
#> [34] gmp_0.6-9 farver_2.1.1 generics_0.1.3
#> [37] withr_2.5.0 cachem_1.0.6 nnet_7.3-18
#> [40] Rmpfr_0.8-9 cli_3.4.1 fastGHQuad_1.0.1
#> [43] mnormt_2.1.1 magrittr_2.0.3 mclust_6.0.0
#> [46] mcmc_0.9-7 evaluate_0.18 fansi_1.0.3
#> [49] doParallel_1.0.17 MASS_7.3-58.1 tools_4.2.2
#> [52] lifecycle_1.0.3 stringr_1.4.1 MCMCpack_1.6-3
#> [55] munsell_0.5.0 cluster_2.1.4 compiler_4.2.2
#> [58] jquerylib_0.1.4 evd_2.3-6.1 rlang_1.0.6
#> [61] grid_4.2.2 nloptr_2.0.3 iterators_1.0.14
#> [64] rstudioapi_0.14 labeling_0.4.2 cobs_1.3-5
#> [67] rmarkdown_2.18 boot_1.3-28 gtable_0.3.1
#> [70] codetools_0.2-18 R6_2.5.1 knitr_1.41
#> [73] dplyr_1.0.10 fastmap_1.1.0 utf8_1.2.2
#> [76] stringi_1.7.8 parallel_4.2.2 Rcpp_1.0.9
#> [79] vctrs_0.5.0 tidyselect_1.2.0 xfun_0.34
#> [82] coda_0.19-4