BCClong
is an R package for performing Bayesian
Consensus Clustering (BCC) model for clustering continuous, discrete and
categorical longitudinal data, which are commonly seen in many clinical
studies. This document gives a tour of BCClong package.
see help(package = "BCClong")
for more information and
references provided by citation("BCClong")
To download BCClong, use the following commands:
require("devtools")
::install_github("ZhiwenT/BCClong", build_vignettes = TRUE)
devtoolslibrary("BCClong")
To list all functions available in this package:
ls("package:BCClong")
Currently, there are 5 function in this package which are BCC.multi, BayesT, model.selection.criteria, traceplot, trajplot.
BCC.multi function performs clustering on mixed-type (continuous, discrete and categorical) longitudinal markers using Bayesian consensus clustering method with MCMC sampling and provide a summary statistics for the computed model. This function will take in a data set and multiple parameters and output a BCC model with summary statistics.
BayesT function assess the model goodness of fit by calculate the discrepancy measure T(, ) with following steps (a) Generate T.obs based on the MCMC samples (b) Generate T.rep based on the posterior distribution of the parameters (c) Compare T.obs and T.rep, and calculate the P values.
model.selection.criteria function calculates DIC and WAIC for the fitted model traceplot function visualize the MCMC chain for model parameters trajplot function plot the longitudinal trajectory of features by local and global clustering
In this example, the PBCseq
data in the
mixAK
package was used as it is a public data set. The
variables used here include lbili, platelet, and spiders. Of these three
variables, lbili and platelet are continuous variables, while spiders
are categorical variables.
library(BCClong)
library(mixAK)
data(PBC910)
Here, We used a binomial distribution for spiders marker, a gaussian distribution for the lbili marker and poisson distribution for platelet, respectively. The number of clusters was set to 2. All hyper parameters were set to default.
We ran the model with 12,000 iterations, discard the first 2,000 sample, and kept every 10th sample. This resulted in 1,000 samples for each model parameter. The MCMC sampling process took about 30 minutes on an AMD Ryzen\(^{TM}\) 5 5600X desktop computer.
Since this program takes a long time to run, here we will use the
pre-compile result in this example. The pre-compiled data file can be
found here (./inst/extdata/PBCseq.rds
)
set.seed(89)
<- BCC.multi(
fit.BCC2 mydat = list(PBC910$lbili,PBC910$platelet,PBC910$spiders),
dist = c("gaussian","poisson","binomial"),
id = list(PBC910$id),
time = list(PBC910$month),
formula =list(y ~ time + (1|id),y ~ time + (1|id), y ~ time + (1|id)),
num.cluster = 2,
burn.in = 100,
thin = 10,
per = 10,
max.iter = 200)
To run the pre-compiled result, please download the PBCseq.rds object
from github under inst/extdata/
folder. Then run the
following code.
# pre-compiled result
<- readRDS("../inst/extdata/PBCseq.rds") fit.BCC2
To print the summary statistics for all parameters
$summary.stat fit.BCC2
To print the proportion for each cluster (mean, sd, 2.5% and 97.5% percentile) geweke statistics (geweke.stat) between -2 and 2 suggests the parameters converge
$summary.stat$PPI fit.BCC2
The code below prints out all major parameters
print(fit.BCC2$summary.stat$PPI)
#> [,1] [,2]
#> mean 0.92156956 0.07843044
#> sd 0.05106657 0.05106657
#> 2.5% 0.85047520 0.01171743
#> 97.5% 0.98828257 0.14952480
#> geweke.stat -10.34604373 10.34604373
print(fit.BCC2$summary.stat$ALPHA)
#> [,1] [,2] [,3]
#> mean 0.504219437 0.51748095 0.80783581
#> sd 0.003763516 0.01699388 0.02903671
#> 2.5% 0.500782965 0.50037179 0.77781701
#> 97.5% 0.511449666 0.55117093 0.85592659
#> geweke.stat 1.895661030 4.42684423 0.80798946
print(fit.BCC2$cluster.global)
#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [260] 1
print(fit.BCC2$cluster.local[[1]])
#> [1] 2 2 1 1 2 2 2 1 2 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 1 2 1 2 2 1 1 2 1 2 1 2 2
#> [38] 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 2 2 2 2 2 1 1 2 2 1 1
#> [75] 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 1 1 1 2 1 2 1 2 1 2 1 2 2 2 2 2 2 1 2
#> [112] 2 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 2 2
#> [149] 2 1 2 1 2 2 1 2 2 1 2 2 2 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2
#> [186] 1 2 1 1 2 2 2 1 2 2 2 2 1 1 1 2 1 1 2 2 2 2 2 1 2 2 2 2 1 1 2 2 2 1 1 2 1
#> [223] 1 2 2 2 1 1 1 2 2 1 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2 1 2 2 2 2 1 2 2 2 2 2 2
#> [260] 1
print(fit.BCC2$cluster.local[[2]])
#> [1] 2 2 2 2 1 2 1 2 1 2 2 2 1 2 1 1 2 1 1 1 2 2 2 2 1 1 2 1 2 2 1 2 2 1 2 1 1
#> [38] 2 1 1 2 2 2 1 1 1 2 2 2 2 2 1 2 1 2 2 1 1 2 1 2 2 2 1 1 1 2 2 1 2 1 2 1 1
#> [75] 1 1 2 1 2 1 1 2 1 2 2 2 1 2 2 2 1 1 1 2 1 2 2 2 2 1 2 1 1 2 2 1 1 2 1 2 2
#> [112] 1 1 2 1 1 2 2 1 2 1 1 2 1 1 1 2 1 1 1 2 2 2 1 1 1 1 2 1 1 1 2 1 1 2 2 2 1
#> [149] 1 1 2 2 2 2 1 1 1 2 2 2 1 1 2 2 1 2 1 2 2 2 1 1 2 2 1 1 2 2 1 2 2 2 2 1 1
#> [186] 1 2 2 2 1 1 1 1 1 2 1 1 1 2 2 1 2 1 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 2 1 2
#> [223] 2 1 1 2 2 1 1 2 1 2 1 2 2 1 1 2 2 2 1 1 1 2 2 2 2 2 1 1 1 2 1 2 1 1 2 1 1
#> [260] 2
print(fit.BCC2$cluster.local[[3]])
#> [1] 2 2 2 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1
#> [38] 1 1 1 1 2 1 2 2 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1
#> [75] 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 2 2 1 2 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 1
#> [112] 1 1 1 2 2 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 2 1
#> [149] 1 1 1 2 1 1 2 1 1 1 2 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [186] 1 1 2 2 1 2 1 2 2 1 1 1 2 1 1 1 1 1 1 2 1 1 2 2 2 2 1 1 1 1 2 1 1 1 1 1 1
#> [223] 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1
#> [260] 2
We can use the traceplot function to plot the MCMC process and the trajplot function to plot the trajectory for each feature.
<- trajplot(fit=fit.BCC2,feature.ind=1,which.cluster = "local.cluster",
gp1 title= bquote(paste("Local Clustering (",hat(alpha)[1] ==.(round(fit.BCC2$alpha[1],2)),")")),
xlab="months",ylab="lbili",color=c("#00BA38", "#619CFF"))
<- trajplot(fit=fit.BCC2,feature.ind=2,which.cluster = "local.cluster",
gp2 title= bquote(paste("Local Clustering (",hat(alpha)[2] ==.(round(fit.BCC2$alpha[2],2)),")")),
xlab="months",ylab="platelet",color=c("#00BA38", "#619CFF"))
<- trajplot(fit=fit.BCC2,feature.ind=3,which.cluster = "local.cluster",
gp3 title= bquote(paste("Local Clustering (",hat(alpha)[3] ==.(round(fit.BCC2$alpha[3],2)),")")),
xlab="months",ylab="spiders",color=c("#00BA38", "#619CFF"))
<- trajplot(fit=fit.BCC2,feature.ind=1,which.cluster = "global.cluster",
gp4 title="Global Clustering",
xlab="months",ylab="lbili",color=c("#00BA38", "#619CFF"))
<- trajplot(fit=fit.BCC2,feature.ind=2,which.cluster = "global.cluster",
gp5 title="Global Clustering",
xlab="months",ylab="platelet",color=c("#00BA38", "#619CFF"))
<- trajplot(fit=fit.BCC2,feature.ind=3,which.cluster = "global.cluster",
gp6 title="Global Clustering",
xlab="months",ylab="spiders",color=c("#00BA38", "#619CFF"))
library(cowplot)
#dev.new(width=180, height=120)
plot_grid(gp1, gp2,gp3,gp4,gp5,gp6,
labels=c("(A)", "(B)", "(C)", "(D)", "(E)", "(F)"), ncol = 3, align = "v" )
sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22621)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] mixAK_5.5 lme4_1.1-31 Matrix_1.5-1 colorspace_2.0-3
#> [5] cowplot_1.1.1 ggplot2_3.4.0 joineRML_0.4.5 survival_3.4-0
#> [9] nlme_3.1-160 BCClong_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.2 LaplacesDemon_16.1.6 jsonlite_1.8.3
#> [4] splines_4.2.2 foreach_1.5.2 label.switching_1.8
#> [7] bslib_0.4.1 rngWELL_0.10-8 highr_0.9
#> [10] stats4_4.2.2 yaml_2.3.6 truncdist_1.0-2
#> [13] randtoolbox_2.0.3 pillar_1.8.1 lattice_0.20-45
#> [16] quantreg_5.94 glue_1.6.2 digest_0.6.30
#> [19] minqa_1.2.5 htmltools_0.5.4 lpSolve_5.6.17
#> [22] pkgconfig_2.0.3 SparseM_1.81 mvtnorm_1.1-3
#> [25] scales_1.2.1 MatrixModels_0.5-1 tibble_3.1.8
#> [28] combinat_0.0-8 mgcv_1.8-41 gmp_0.6-9
#> [31] farver_2.1.1 generics_0.1.3 withr_2.5.0
#> [34] cachem_1.0.6 nnet_7.3-18 Rmpfr_0.8-9
#> [37] cli_3.4.1 fastGHQuad_1.0.1 mnormt_2.1.1
#> [40] magrittr_2.0.3 mclust_6.0.0 mcmc_0.9-7
#> [43] evaluate_0.18 fansi_1.0.3 doParallel_1.0.17
#> [46] MASS_7.3-58.1 tools_4.2.2 lifecycle_1.0.3
#> [49] stringr_1.4.1 MCMCpack_1.6-3 munsell_0.5.0
#> [52] cluster_2.1.4 compiler_4.2.2 jquerylib_0.1.4
#> [55] evd_2.3-6.1 rlang_1.0.6 grid_4.2.2
#> [58] nloptr_2.0.3 iterators_1.0.14 rstudioapi_0.14
#> [61] labeling_0.4.2 cobs_1.3-5 rmarkdown_2.18
#> [64] boot_1.3-28 gtable_0.3.1 codetools_0.2-18
#> [67] R6_2.5.1 knitr_1.41 dplyr_1.0.10
#> [70] fastmap_1.1.0 utf8_1.2.2 stringi_1.7.8
#> [73] parallel_4.2.2 Rcpp_1.0.9 vctrs_0.5.0
#> [76] tidyselect_1.2.0 xfun_0.34 coda_0.19-4