This package implements an exhaustive framework to perform post-selection inference with kernels.
It uses quadratic kernel association scores to measure the association between a given kernel and an outcome of interest. These scores are used for the selection of the kernels in a forward fashion. If kernels are defined on sets of features, this allows for non-linear feature selection; if the kernels used all belong to the same family, but with different hyperparameters, this allows for hyperparameter selection.
The selection procedure allows the modeling of the selection event as a succession of quadratic constraints of the outcome. Finally, under the selection event, we derive empirical p-values to measure the significance of the effect of the selected kernels on the outcome.
You can install the released version of kernelPSI from CRAN with:
install.packages("kernelPSI")
The latest development version is directly available from GitHub:
install.packages("devtools")
::install_github("EpiSlim/kernelPSI") devtools
We illustrate the use of kernelPSI
on a toy example. For
the sake of simplicity, we use linear kernels. Other commonly-used
kernels are also available from the R package kernlab
.
require("kernelPSI")
require("kernlab")
require("bindata")
set.seed(64)
# Generation of the covariates, the similarity matrices and the outcome
<- 10 # total number of kernels
n_kernels <- 3 # number of kernels used to generate the outcome
m_kernels <- 5 # dimensionality of the data associated to each kernel
size_kernels
<- .05 # effect size
theta
<- 100 # sample size
n <- 0.6 # correlation parameter (comprised between -1 and +1)
rho
# Correlation matrix
<- outer(seq_len(size_kernels), seq_len(size_kernels),
corr function(i, j) return(rho^(abs(i-j))))
<- 0.4 # parameter for for the Bernoulli margin distributions
marg
# Design matrix
<- replicate(n_kernels,
X rmvbin(n, margprob = rep(marg, size_kernels), bincorr = corr),
simplify = FALSE)
<- replicate(n_kernels, vanilladot()) # full set of kernels
K
# Gram matrices
<- sapply(seq_len(n_kernels),
Kmat function(i) {kMatrix <- kernelMatrix(K[[i]], X[[i]]); return(as.kernelMatrix(kMatrix, center = TRUE))},
simplify = FALSE)
<- Reduce(`+`, Kmat[seq_len(m_kernels)]) # Gram matrix for the sum kernel of the first m_kernels kernels
Ksum <- eigen(Ksum) # eigenvalue decomposition of the Ksum matrix
decompK
<- as.matrix(theta * decompK$values[1] * decompK$vectors[, 1] + rnorm(n), ncol = 1) # response vector
Y <- kernelMatrix(new("vanillakernel"), Y) # linear kernel of the response Lmat
We can now proceed to the selection of the kernels, using either the fixed or adaptive variants.
<- 4 # number of selected kernels for the fixed variant
candidate_kernels
<- FOHSIC(Kmat, Lmat, mKernels = candidate_kernels) # Fixed variant
selectFOHSIC <- adaFOHSIC(Kmat, Lmat) # adaptive variant selectAHSIC
Before drawing replicates under the selection event, we first need to model the corresponding constraints.
<- forwardQ(Kmat, selectFOHSIC)
constraintFO
<- adaQ(Kmat, selectAHSIC[["selection"]], selectAHSIC[["n"]])
adaFO <- selectAHSIC$selection[seq_len(selectAHSIC$n)] # indices of selected kernels adaS
The wrapper function kernelPSI
computes p-values for
three different statistics (see documentation).
<- 1000 # number of replicates (to be increased for real use cases)
n_replicates <- 100 # number of burn-in iterations
burn_in
# Fixed variant ------------------
# selected methods: 'ridge' for the kernel ridge regression prototype
# and 'hsic' for the HSIC unbiased estimator
kernelPSI(Y, K_select = Kmat[selectFOHSIC], constraintFO, method = c("ridge", "hsic"),
n_replicates = n_replicates, burn_in = burn_in)
#> $ridge
#> [1] 0.022
#>
#> $hsic
#> [1] 0.06
# Adaptive variant ------------------
# selected method: 'pca' for the kernel principal component regression prototype
kernelPSI(Y, K_select = Kmat[adaS], constraintFO, method = c("pca"),
n_replicates = n_replicates, burn_in = burn_in)
#> $pca
#> [1] 0.192
Lotfi Slim, Clément Chatelain, Chloé-Agathe Azencott, and Jean-Philippe Vert. kernelPSI: a post-selection inference framework for nonlinear variable selection, Proceedings of the Thirty-Sixth International Conference on Machine Learning (ICML), 2019.