Abstract
In this vignette, we use repeated cross-validation
to tune the hyperparameters of a custom model function
with cross_validate_fn()
. Additionally, we learn to
preprocess the training and test folds within the cross-validation.
Examples of model functions, predict functions and preprocess
functions are available in model_functions()
,
predict_functions()
, and
preprocess_functions()
. These can be used directly or as
starting points.
Contact the author at r-pkgs@ludvigolsen.dk
Where cross_validate()
only allows us to cross-validate
the lm()
, lmer()
, glm()
, and
glmer()
model functions, cross_validate_fn()
can cross-validate any model function that can be
wrapped in a specific function and predict new data. This means, we can
cross-validate support-vector machines, neural
networks, naïve Bayes, etc., and compare them.
As these model functions are all a bit different and return
predictions in different formats, we need to wrap them in a specific set
of functions that cross_validate_fn()
knows how to deal
with. That requires a bit more work than using
cross_validate()
but is very flexible.
In this vignette, we will learn how to convert our model to a model function and associated predict function. After doing this once, you can save them in a separate R script and reuse them in future projects.
Once you’ve completed this tutorial, you will be able to:
We start by attaching the needed packages and setting a random seed for reproducibility:
library(cvms)
library(groupdata2) # fold()
library(dplyr)
library(knitr) # kable() : formats the output as a table
library(e1071) # svm()
set.seed(1)
We could enable parallelization to speed up the
fold()
and cross_validate_fn()
functions.
Uncomment the below and set parallel = TRUE
when calling
those functions.
# Enable parallelization by uncommenting
# library(doParallel)
# registerDoParallel(4) # 4 cores
For the regression and binary classification examples, we will use
the participant.scores
dataset from cvms
. It
features 10 participants who partook in an incredibly fictional study
with three sessions of a task. Some of the participants have the
diagnosis 1
(sounds scary!) and they all
got a score after each session.
In order to use the data for cross-validation, we need to divide it
into folds. For this we use the fold()
function from
groupdata2
. We create 3 folds as it is a small dataset.
With more data, it is common to use 10 folds (although that is a bit
arbitrary).
As the randomness in this data splitting process can influence the
model comparison, we do it multiple times and average the results. This
is called repeated cross-validation. By setting
num_fold_cols = 5
, fold()
will create 5
unique fold columns. Unless your model takes a long time to
fit, it’s common to use 10-100 repetitions, but we pick 5 to speed up
the process for this tutorial. Remember, that you can enable
parallelization to utilize multiple CPU cores.
By setting cat_col = "diagnosis"
, we ensure a similar
ratio of participants with and without the diagnosis in all the
folds.
By setting id_col = "participant"
, we ensure that all
the rows belonging to a participant are put in the same fold. If we do
not ensure this, we could be testing on a participant we also trained
on, which is cheating! :)
# Prepare dataset
<- participant.scores
data $diagnosis <- factor(data$diagnosis)
data
# Create 5 fold columns with 3 folds each
<- fold(
data
data,k = 3,
cat_col = "diagnosis",
id_col = "participant",
num_fold_cols = 5,
parallel = FALSE # set to TRUE to run in parallel
)
# Order by participant
<- data %>%
data ::arrange(participant)
dplyr
# Look at the first 12 rows
# Note: kable() just formats the table
%>% head(12) %>% kable() data
participant | age | diagnosis | score | session | .folds_1 | .folds_2 | .folds_3 | .folds_4 | .folds_5 |
---|---|---|---|---|---|---|---|---|---|
1 | 20 | 1 | 10 | 1 | 3 | 3 | 3 | 2 | 3 |
1 | 20 | 1 | 24 | 2 | 3 | 3 | 3 | 2 | 3 |
1 | 20 | 1 | 45 | 3 | 3 | 3 | 3 | 2 | 3 |
2 | 23 | 0 | 24 | 1 | 1 | 3 | 3 | 3 | 2 |
2 | 23 | 0 | 40 | 2 | 1 | 3 | 3 | 3 | 2 |
2 | 23 | 0 | 67 | 3 | 1 | 3 | 3 | 3 | 2 |
3 | 27 | 1 | 15 | 1 | 2 | 1 | 3 | 1 | 1 |
3 | 27 | 1 | 30 | 2 | 2 | 1 | 3 | 1 | 1 |
3 | 27 | 1 | 40 | 3 | 2 | 1 | 3 | 1 | 1 |
4 | 21 | 0 | 35 | 1 | 3 | 3 | 3 | 1 | 3 |
4 | 21 | 0 | 50 | 2 | 3 | 3 | 3 | 1 | 3 |
4 | 21 | 0 | 78 | 3 | 3 | 3 | 3 | 1 | 3 |
We can check that the cat_col
and id_col
arguments did their thing:
# Check the distribution of 'diagnosis' in the first fold column
# Note: this would be more even for a larger dataset
%>%
data ::count(.folds_1, diagnosis) %>%
dplyrkable()
.folds_1 | diagnosis | n |
---|---|---|
1 | 0 | 3 |
1 | 1 | 6 |
2 | 0 | 3 |
2 | 1 | 6 |
3 | 0 | 6 |
3 | 1 | 6 |
# Check the distribution of 'participant' in the first fold column
# Note that all rows for a participant are in the same fold
%>%
data ::count(.folds_1, participant) %>%
dplyrkable()
.folds_1 | participant | n |
---|---|---|
1 | 2 | 3 |
1 | 5 | 3 |
1 | 7 | 3 |
2 | 3 | 3 |
2 | 6 | 3 |
2 | 10 | 3 |
3 | 1 | 3 |
3 | 4 | 3 |
3 | 8 | 3 |
3 | 9 | 3 |
Now that we have the data ready, we can try to predict the
score
with a support-vector machine (SVM). While building
the analysis, we will use the third fold in the first fold column as the
test set and the rest as training set. The SVM will be fitted on the
training set and used to predict the score in the test set. Finally, we
will evaluate the predictions with evaluate()
:
# Split into train and test sets
<- data %>%
test_set ::filter(.folds_1 == 3)
dplyr<- data %>%
train_set ::filter(.folds_1 != 3)
dplyr
# Fit SVM model
<- e1071::svm(
svm_model formula = score ~ diagnosis + age + session,
data = train_set,
kernel = "linear",
cost = 10,
type = "eps-regression"
)
# Predict scores in the test set
<- predict(
predicted_scores
svm_model,newdata = test_set,
allow.new.levels = TRUE)
predicted_scores#> 1 2 3 4 5 6 7 8
#> 9.412735 27.545911 45.679086 28.305468 46.438644 64.571820 9.652563 27.785739
#> 9 10 11 12
#> 45.918915 31.423234 49.556410 67.689586
# Add predictions to test set
"predicted score"]] <- predicted_scores
test_set[[
# Evaluate the predictions
evaluate(
data = test_set,
target_col = "score",
prediction_cols = "predicted score",
type = "gaussian"
)#> # A tibble: 1 × 8
#> RMSE MAE `NRMSE(IQR)` RRSE RAE RMSLE Predictions Process
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
#> 1 5.25 3.97 0.253 0.277 0.256 0.171 <tibble [12 × 2]> <prcss_n_>
The Mean Absolute Error (MAE
) metric tells us that
predictions 3.97
off on average. That’s only for one fold
though, so next we will convert the model to a model function, and the
predict()
call to a predict function, that can be used
within cross_validate_fn()
.
A model function for cross_validate_fn()
has the
arguments train_data
, formula
and
hyperparameters
. We don’t need to use all of them, but
those are the inputs it will receive when called inside
cross_validate_fn()
.
To convert our model, we wrap it like this:
<- function(train_data, formula, hyperparameters) {
svm_model_fn ::svm(
e1071formula = formula,
data = train_data,
kernel = "linear",
cost = 10,
type = "eps-regression"
) }
Note: In R, the last thing in a function is
returned. This means we don’t need to use return()
in the
above. Feel free to add it though.
We can test the model function on the training set:
# Try the model function
# Due to "lazy evaluation" in R, we don't have to pass
# the arguments that are not used inside the function
<- svm_model_fn(train_data = train_set,
m0 formula = score ~ diagnosis + age + session)
m0#>
#> Call:
#> svm(formula = formula, data = train_data, kernel = "linear", cost = 10,
#> type = "eps-regression")
#>
#>
#> Parameters:
#> SVM-Type: eps-regression
#> SVM-Kernel: linear
#> cost: 10
#> gamma: 0.25
#> epsilon: 0.1
#>
#>
#> Number of Support Vectors: 17
The predict()
function varies a bit for different
models, so we need to supply a predict function that works with our
model function and returns the predictions in the right format. Which
format is correct depends on the kind of task we are performing. For
regression ('gaussian'
), the predictions should be a
single vector with the predicted values.
The predict function must have the arguments test_data
,
model
, formula
, hyperparameters
,
and train_data
. Again, we don’t need to use all of them
inside the function, but they must be there.
We can convert our predict()
call to a predict function
like so:
<- function(test_data, model, formula, hyperparameters, train_data) {
svm_predict_fn predict(object = model,
newdata = test_data,
allow.new.levels = TRUE)
}
# Try the predict function
svm_predict_fn(test_data = test_set, model = m0)
#> 1 2 3 4 5 6 7 8
#> 9.412735 27.545911 45.679086 28.305468 46.438644 64.571820 9.652563 27.785739
#> 9 10 11 12
#> 45.918915 31.423234 49.556410 67.689586
Now, we can cross-validate our model function!
We will cross-validate a couple of formulas at once. These are passed
as strings and converted to formula objects internally. We also supply
the dataset with the fold columns and the names of the fold columns
(".folds_1"
, ".folds_2"
, etc.).
<- cross_validate_fn(
cv_1 data = data,
formulas = c("score ~ diagnosis + age + session",
"score ~ diagnosis + age",
"score ~ diagnosis"),
type = "gaussian",
model_fn = svm_model_fn,
predict_fn = svm_predict_fn,
fold_cols = paste0(".folds_", 1:5),
parallel = FALSE # set to TRUE to run in parallel
)#> Will cross-validate 3 models. This requires fitting 45 model instances.
cv_1#> # A tibble: 3 × 17
#> Fixed RMSE MAE NRMSE…¹ RRSE RAE RMSLE Predic…² Results Coeffi…³ Folds
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <list> <int>
#> 1 diagno… 10.5 8.25 0.543 0.571 0.561 0.298 <tibble> <tibble> <tibble> 15
#> 2 diagno… 17.9 15.0 0.905 0.973 1.01 0.506 <tibble> <tibble> <tibble> 15
#> 3 diagno… 16.7 14.3 0.828 0.902 0.947 0.486 <tibble> <tibble> <tibble> 15
#> # … with 6 more variables: `Fold Columns` <int>, `Convergence Warnings` <int>,
#> # `Other Warnings` <int>, `Warnings and Messages` <list>, Process <list>,
#> # Dependent <chr>, and abbreviated variable names ¹`NRMSE(IQR)`,
#> # ²Predictions, ³Coefficients
The first formula has the lowest Root Mean Square Error
(RMSE
). Before learning how to inspect the output, let’s
enable hyperparameters, so we can try different kernels and costs.
Hyperparameters are settings for the model function that we can tweak
to get better performance. The SVM model has multiple of these settings
that we can play with, like the kernel
and
cost
settings.
The hyperparameters passed to the model function can be indexed with
[["name"]]
. This means we can get the kernel for the
current model instance with hyperparameters[["kernel"]]
. In
case the user forgets to pass a kernel in the hyperparameters, we need
to check if it’s available though. Here’s an example:
<- function(train_data, formula, hyperparameters) {
svm_model_fn
# Required hyperparameters:
# - kernel
# - cost
if (!"kernel" %in% names(hyperparameters))
stop("'hyperparameters' must include 'kernel'")
if (!"cost" %in% names(hyperparameters))
stop("'hyperparameters' must include 'cost'")
::svm(
e1071formula = formula,
data = train_data,
kernel = hyperparameters[["kernel"]],
cost = hyperparameters[["cost"]],
scale = FALSE,
type = "eps-regression"
)
}
# Try the model function
svm_model_fn(
train_data = train_set,
formula = score ~ diagnosis + age + session,
hyperparameters = list(
"kernel" = "linear",
"cost" = 5
)
)#>
#> Call:
#> svm(formula = formula, data = train_data, kernel = hyperparameters[["kernel"]],
#> cost = hyperparameters[["cost"]], type = "eps-regression", scale = FALSE)
#>
#>
#> Parameters:
#> SVM-Type: eps-regression
#> SVM-Kernel: linear
#> cost: 5
#> gamma: 0.25
#> epsilon: 0.1
#>
#>
#> Number of Support Vectors: 18
When we have a lot of hyperparameters, we quickly get a lot of
if
statements for performing those checks. We may also wish
to provide default values when a hyperparameter is not passed by the
user. For this purpose, the update_hyperparameters()
function is available. It checks that required hyperparameters are
present and set default values for those that are not required
and were not passed by the user.
There are three parts to the inputs to
update_hyperparameters()
:
The default hyperparameter values (passed
first). E.g. if we wish to set the default for the
kernel
parameter to "radial"
, we simply pass
kernel = "radial"
. When the user doesn’t pass a
kernel
setting, this default value is used.
The list of hyperparameters. Remember to name the argument when
passing it, i.e.:
hyperparameters = hyperparameters
.
The names of the required hyperparameters. If
any of these are not in the hyperparameters, an error is thrown.
Remember to name the argument when passing it, e.g.:
.required = c("cost", "scale")
.
It returns the updated list of hyperparameters.
Let’s specify the model function such that cost
must be passed, while kernel
is
optional and has the default value
"radial"
:
<- function(train_data, formula, hyperparameters) {
svm_model_fn
# Required hyperparameters:
# - cost
# Optional hyperparameters:
# - kernel
# 1) If 'cost' is not present in hyperparameters, throw error
# 2) If 'kernel' is not present in hyperparameters, set to "radial"
<- update_hyperparameters(
hyperparameters kernel = "radial",
hyperparameters = hyperparameters,
required = "cost"
)
::svm(
e1071formula = formula,
data = train_data,
kernel = hyperparameters[["kernel"]],
cost = hyperparameters[["cost"]],
type = "eps-regression"
) }
In order to find the best combination of hyperparameters for our model, we can simply try all of them. This is called grid search.
We specify the different values we wish to try per hyperparameter in a list of named vectors, like so:
<- list(
hparams "kernel" = c("linear", "radial"),
"cost" = c(1, 5, 10)
)
cross_validate_fn()
will then cross-validate every
combination of the values.
If we want to randomly sample 4 of the combinations (e.g. to save
time), we can pass .n = 4
in the beginning of the list:
<- list(
hparams ".n" = 4,
"kernel" = c("linear", "radial"),
"cost" = c(1, 5, 10)
)
Alternatively, we can supply a data frame with the exact combinations we want. Each column should be a hyperparameter and each row a combination of the values to cross-validate. E.g.:
<- data.frame(
df_hparams "kernel" = c("linear", "radial", "radial"),
"cost" = c(10, 1, 10)
)
df_hparams#> kernel cost
#> 1 linear 10
#> 2 radial 1
#> 3 radial 10
We will use the hparams
list.
Now, we can cross-validate our hyperparameters:
# Set seed for the sampling of the hyperparameter combinations
set.seed(1)
<- cross_validate_fn(
cv_2 data = data,
formulas = c("score ~ diagnosis + age + session",
"score ~ diagnosis"),
type = "gaussian",
model_fn = svm_model_fn,
predict_fn = svm_predict_fn,
hyperparameters = hparams, # Pass the list of values to test
fold_cols = paste0(".folds_", 1:5)
)#> Will cross-validate 8 models. This requires fitting 120 model instances.
cv_2#> # A tibble: 8 × 18
#> Fixed RMSE MAE NRMSE…¹ RRSE RAE RMSLE Predic…² Results Coeffi…³ Folds
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <list> <int>
#> 1 diagno… 10.2 8.03 0.522 0.556 0.544 0.286 <tibble> <tibble> <tibble> 15
#> 2 diagno… 14.5 11.9 0.738 0.789 0.803 0.407 <tibble> <tibble> <tibble> 15
#> 3 diagno… 10.5 8.26 0.543 0.571 0.562 0.298 <tibble> <tibble> <tibble> 15
#> 4 diagno… 14.9 12.5 0.761 0.813 0.841 0.422 <tibble> <tibble> <tibble> 15
#> 5 diagno… 18.6 15.2 0.930 1.00 1.01 0.514 <tibble> <tibble> <tibble> 15
#> 6 diagno… 17.6 14.8 0.880 0.948 0.982 0.494 <tibble> <tibble> <tibble> 15
#> 7 diagno… 17.3 14.7 0.866 0.935 0.978 0.492 <tibble> <tibble> <tibble> 15
#> 8 diagno… 17.1 14.6 0.850 0.921 0.969 0.489 <tibble> <tibble> <tibble> 15
#> # … with 7 more variables: `Fold Columns` <int>, `Convergence Warnings` <int>,
#> # `Other Warnings` <int>, `Warnings and Messages` <list>, Process <list>,
#> # HParams <list<tibble[,2]>>, Dependent <chr>, and abbreviated variable names
#> # ¹`NRMSE(IQR)`, ²Predictions, ³Coefficients
The output has a lot of information and can be a bit hard to read.
The first thing, we wish to know, is which model performed the best. We
will use the RMSE
metric to determine this. Lower is
better.
We order the data frame by the RMSE
and use the
select_definitions()
function to extract the formulas and
hyperparameters. To recognize the model after the sorting, we create a
Model ID
column and include it along with the
RMSE
column:
%>%
cv_2 # Create Model ID with values 1:8
::mutate(`Model ID` = 1:nrow(cv_2)) %>%
dplyr# Order by RMSE
::arrange(RMSE) %>%
dplyr# Extract formulas and hyperparameters
select_definitions(additional_includes = c("RMSE", "Model ID")) %>%
# Pretty table
kable()
Dependent | Fixed | kernel | cost | RMSE | Model ID |
---|---|---|---|---|---|
score | diagnosis+age+session | linear | 1 | 10.19112 | 1 |
score | diagnosis+age+session | linear | 5 | 10.45016 | 3 |
score | diagnosis+age+session | radial | 5 | 14.53008 | 2 |
score | diagnosis+age+session | radial | 10 | 14.93221 | 4 |
score | diagnosis | radial | 10 | 17.08093 | 8 |
score | diagnosis | linear | 5 | 17.33319 | 7 |
score | diagnosis | radial | 5 | 17.56135 | 6 |
score | diagnosis | linear | 1 | 18.56190 | 5 |
The best model uses kernel = "linear"
and
cost = 1
. Our Model ID
column tells us this
was the first row in the output. We can use this to access the
predictions, fold results, warnings, and more:
# Extract fold results for the best model
$Results[[1]] %>% kable() cv_2
Fold Column | Fold | RMSE | MAE | NRMSE(IQR) | RRSE | RAE | RMSLE |
---|---|---|---|---|---|---|---|
.folds_1 | 1 | 13.385262 | 10.403463 | 0.4461754 | 0.7605651 | 0.7190107 | 0.3841163 |
.folds_1 | 2 | 8.541764 | 6.418853 | 0.5694509 | 0.4291060 | 0.4146149 | 0.1991230 |
.folds_1 | 3 | 5.087366 | 4.001948 | 0.2451743 | 0.2688444 | 0.2581902 | 0.1367535 |
.folds_2 | 1 | 8.541764 | 6.418853 | 0.5694509 | 0.4291060 | 0.4146149 | 0.1991230 |
.folds_2 | 2 | 11.641150 | 7.999899 | 0.5543405 | 0.6775330 | 0.5408947 | 0.3356195 |
.folds_2 | 3 | 6.257352 | 4.526802 | 0.2812293 | 0.3278149 | 0.2968395 | 0.1876330 |
.folds_3 | 1 | 13.222029 | 10.549869 | 0.5085396 | 0.6736734 | 0.6283378 | 0.3611420 |
.folds_3 | 2 | 17.846276 | 15.780440 | 1.1153922 | 1.0791058 | 1.2220035 | 0.5094823 |
.folds_3 | 3 | 6.257701 | 4.599791 | 0.2812450 | 0.3260698 | 0.3032829 | 0.1830861 |
.folds_4 | 1 | 14.771646 | 13.059488 | 1.3428769 | 0.7970060 | 0.9776511 | 0.4303192 |
.folds_4 | 2 | 8.072439 | 5.624487 | 0.3844018 | 0.4681478 | 0.3954717 | 0.2366059 |
.folds_4 | 3 | 12.301286 | 9.895444 | 0.4241823 | 0.6415395 | 0.5997239 | 0.3373387 |
.folds_5 | 1 | 4.385315 | 3.529620 | 0.3132368 | 0.2825336 | 0.2797449 | 0.1903896 |
.folds_5 | 2 | 13.385262 | 10.403463 | 0.4461754 | 0.7605651 | 0.7190107 | 0.3841163 |
.folds_5 | 3 | 9.170253 | 7.229994 | 0.3460473 | 0.4179586 | 0.3919876 | 0.2207415 |
# Extract 10 predictions from the best model
$Predictions[[1]] %>% head(10) %>% kable() cv_2
Fold Column | Fold | Observation | Target | Prediction |
---|---|---|---|---|
.folds_1 | 1 | 4 | 24 | 36.77192 |
.folds_1 | 1 | 5 | 40 | 52.30052 |
.folds_1 | 1 | 6 | 67 | 67.82912 |
.folds_1 | 1 | 13 | 24 | 11.87638 |
.folds_1 | 1 | 14 | 54 | 27.40497 |
.folds_1 | 1 | 15 | 62 | 42.93357 |
.folds_1 | 1 | 19 | 11 | 10.58407 |
.folds_1 | 1 | 20 | 35 | 26.11267 |
.folds_1 | 1 | 21 | 41 | 41.64127 |
.folds_1 | 2 | 7 | 15 | 14.94300 |
In a moment, we will go through a set of classification examples.
First, we will learn to preprocess the dataset inside
cross_validate_fn()
.
If we wish to preprocess the data, e.g. standardizing the numeric
columns, we can do so within
cross_validate_fn()
. The point is to extract the
preprocessing parameters (mean
, sd
,
min
, max
, etc.) from the training data and
apply the transformations to both the training data and test data.
cvms
has built-in examples of preprocessing functions
(see ?preprocess_functions()
). They use the
recipes
package and are good starting points for writing
your own preprocess function.
A preprocess function should have these arguments:
train_data
, test_data
, formula
,
and hyperparameters
. Again, we can choose to only use some
of them.
It should return a named list with the preprocessed training data
("train"
) and test data ("test"
). We can also
include a data frame with the preprocessing parameters we used
("parameters"
), so we can extract those later from the
cross_validate_fn()
output.
The form should be like this:
# NOTE: Don't run this
<- function(train_data, test_data, formula, hyperparameters) {
preprocess_fn
# Do preprocessing
# Create data frame with applied preprocessing parameters
# Return list with these names
list("train" = train_data,
"test" = test_data,
"parameters" = preprocess_parameters)
}
Our preprocess function will standardize the age
column:
<- function(train_data, test_data, formula, hyperparameters) {
preprocess_fn
# Standardize the age column
# Get the mean and standard deviation from the train_data
<- mean(train_data[["age"]])
mean_age <- sd(train_data[["age"]])
sd_age
# Standardize both train_data and test_data
"age"]] <- (train_data[["age"]] - mean_age) / sd_age
train_data[["age"]] <- (test_data[["age"]] - mean_age) / sd_age
test_data[[
# Create data frame with applied preprocessing parameters
<- data.frame(
preprocess_parameters "Measure" = c("Mean", "SD"),
"age" = c(mean_age, sd_age)
)
# Return list with these names
list("train" = train_data,
"test" = test_data,
"parameters" = preprocess_parameters)
}
# Try the preprocess function
<- preprocess_fn(train_data = train_set, test_data = test_set)
prepped
# Inspect preprocessed training set
# Note that the age column has changed
$train %>% head(5) %>% kable() prepped
participant | age | diagnosis | score | session | .folds_1 | .folds_2 | .folds_3 | .folds_4 | .folds_5 |
---|---|---|---|---|---|---|---|---|---|
2 | -1.3215082 | 0 | 24 | 1 | 1 | 3 | 3 | 3 | 2 |
2 | -1.3215082 | 0 | 40 | 2 | 1 | 3 | 3 | 3 | 2 |
2 | -1.3215082 | 0 | 67 | 3 | 1 | 3 | 3 | 3 | 2 |
3 | -0.6871843 | 1 | 15 | 1 | 2 | 1 | 3 | 1 | 1 |
3 | -0.6871843 | 1 | 30 | 2 | 2 | 1 | 3 | 1 | 1 |
# Inspect preprocessing parameters
$parameters %>% kable() prepped
Measure | age |
---|---|
Mean | 31.333333 |
SD | 6.305926 |
Now, we add the preprocess function to our
cross_validate_fn()
call. We will only use the winning
hyperparameters from the previous cross-validation, to save time:
<- cross_validate_fn(
cv_3 data = data,
formulas = c("score ~ diagnosis + age + session",
"score ~ diagnosis"),
type = "gaussian",
model_fn = svm_model_fn,
predict_fn = svm_predict_fn,
preprocess_fn = preprocess_fn,
hyperparameters = list(
"kernel" = "linear",
"cost" = 1
),fold_cols = paste0(".folds_", 1:5)
)#> Will cross-validate 2 models. This requires fitting 30 model instances.
cv_3#> # A tibble: 2 × 19
#> Fixed RMSE MAE NRMSE…¹ RRSE RAE RMSLE Predic…² Results Coeffi…³
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <list>
#> 1 diagnosis+ag… 10.2 8.03 0.522 0.556 0.544 0.286 <tibble> <tibble> <tibble>
#> 2 diagnosis 18.6 15.2 0.930 1.00 1.01 0.514 <tibble> <tibble> <tibble>
#> # … with 9 more variables: Preprocess <list>, Folds <int>,
#> # `Fold Columns` <int>, `Convergence Warnings` <int>, `Other Warnings` <int>,
#> # `Warnings and Messages` <list>, Process <list>, HParams <list<tibble[,2]>>,
#> # Dependent <chr>, and abbreviated variable names ¹`NRMSE(IQR)`,
#> # ²Predictions, ³Coefficients
This didn’t change the results but may do so in other contexts and for other model types.
We can check the preprocessing parameters for the different folds:
# Extract first 10 rows of the preprocess parameters
# for the first and best model
$Preprocess[[1]] %>% head(10) %>% kable() cv_3
Fold Column | Fold | Measure | age |
---|---|---|---|
.folds_1 | 1 | Mean | 26.571429 |
.folds_1 | 1 | SD | 5.608667 |
.folds_1 | 2 | Mean | 27.714286 |
.folds_1 | 2 | SD | 8.337523 |
.folds_1 | 3 | Mean | 31.333333 |
.folds_1 | 3 | SD | 6.305926 |
.folds_2 | 1 | Mean | 27.714286 |
.folds_2 | 1 | SD | 8.337523 |
.folds_2 | 2 | Mean | 25.000000 |
.folds_2 | 2 | SD | 4.743417 |
As mentioned, cvms
has a couple of preprocess functions
available. Here’s the code for the standardizer. If you haven’t used the
recipes
package before, it might not be that easy to read,
but it basically does the same as ours, just to every numeric predictor.
If we were to use it, we would need to make sure that the
participant
, diagnosis
, and (perhaps)
session
columns were factors, as they would otherwise be
standardized as well.
# Get built-in preprocess function
preprocess_functions("standardize")
#> function (train_data, test_data, formula, hyperparameters)
#> {
#> formula <- simplify_formula(formula, train_data)
#> recipe_object <- recipes::recipe(formula = formula, data = train_data) %>%
#> recipes::step_center(recipes::all_numeric(), -recipes::all_outcomes()) %>%
#> recipes::step_scale(recipes::all_numeric(), -recipes::all_outcomes()) %>%
#> recipes::prep(training = train_data)
#> train_data <- recipes::bake(recipe_object, train_data)
#> test_data <- recipes::bake(recipe_object, test_data)
#> means <- recipe_object$steps[[1]]$means
#> sds <- recipe_object$steps[[2]]$sds
#> tidy_parameters <- tibble::tibble(Measure = c("Mean", "SD")) %>%
#> dplyr::bind_cols(dplyr::bind_rows(means, sds))
#> list(train = train_data, test = test_data, parameters = tidy_parameters)
#> }
#> <bytecode: 0x12a7bbe40>
#> <environment: 0x12a7ca3f0>
Given that the preprocess function also receives the hyperparameters, we could write a preprocess function that gets the names of the columns to preprocess via the hyperparameters.
Next, we will go through an example of cross-validating a custom binary classifier.
For the binomial example, we will be predicting the
diagnosis
column with an SVM. For this, we need to tweak
our functions a bit. First, we set
type = "C-classification"
and
probability = TRUE
in the svm()
call. The
first setting makes it perform classification instead of regression. The
second allows us to extract the probabilities in our predict
function.
This model function also works for multiclass classification, why we will reuse it for that in a moment.
# SVM model function for classification
<- function(train_data, formula, hyperparameters) {
clf_svm_model_fn
# Optional hyperparameters:
# - kernel
# - cost
# Update missing hyperparameters with default values
<- update_hyperparameters(
hyperparameters kernel = "radial",
cost = 1,
hyperparameters = hyperparameters
)
::svm(
e1071formula = formula,
data = train_data,
kernel = hyperparameters[["kernel"]],
cost = hyperparameters[["cost"]],
type = "C-classification",
probability = TRUE # Must enable probability here
)
}
# Try the model function
<- clf_svm_model_fn(train_data = data, formula = diagnosis ~ score,
m1 hyperparameters = list("kernel" = "linear"))
m1#>
#> Call:
#> svm(formula = formula, data = train_data, kernel = hyperparameters[["kernel"]],
#> cost = hyperparameters[["cost"]], type = "C-classification",
#> probability = TRUE)
#>
#>
#> Parameters:
#> SVM-Type: C-classification
#> SVM-Kernel: linear
#> cost: 1
#>
#> Number of Support Vectors: 20
The predict function should return the probability of the second class (alphabetically). For the SVM, this is a bit tricky, but we will break it down in steps:
We set probability = TRUE
in the
predict()
call. This stores the probabilities as an
attribute of the predictions. Note that this won’t work if we
forget to enable the probabilities in the model function!
We extract the probabilities with
attr(predictions, "probabilities")
.
We convert the probabilities to a tibble (a kind
of data frame) with dplyr::as_tibble()
.
At this point, we have a tibble with two columns with the
probabilities for each of the two classes. As we need the probability of
the second class, we select and return the second
column (probability of diagnosis
being
1
).
In most cases, the predict function will be simpler to write than this. The main take-away is that we predict the test set and extract the probabilities of the second class.
# Predict function for binomial SVM
<- function(test_data, model, formula, hyperparameters, train_data) {
bnml_svm_predict_fn # Predict test set
<- predict(
predictions object = model,
newdata = test_data,
allow.new.levels = TRUE,
probability = TRUE
)
# Extract probabilities
<- dplyr::as_tibble(attr(predictions, "probabilities"))
probabilities
# Return second column
2]]
probabilities[[
}
<- bnml_svm_predict_fn(test_data = data, model = m1)
p1 # Vector with probabilities that diagnosis is 1
p1 #> [1] 0.1120966 0.2131934 0.4600344 0.2131934 0.3934195 0.7388358 0.1422477
#> [8] 0.2731945 0.3934195 0.3305457 0.5281085 0.8375338 0.2131934 0.5819439
#> [15] 0.6829082 0.1357207 0.2224870 0.2731945 0.1176421 0.3305457 0.4065106
#> [22] 0.1490345 0.2953850 0.4465165 0.3068642 0.5686161 0.7281734 0.2624979
#> [29] 0.5951531 0.8585921
Now, we can cross-validate the model function:
<- cross_validate_fn(
cv_4 data = data,
formulas = c("diagnosis ~ score",
"diagnosis ~ age"),
type = "binomial",
model_fn = clf_svm_model_fn,
predict_fn = bnml_svm_predict_fn,
hyperparameters = list(
"kernel" = c("linear", "radial"),
"cost" = c(1, 5, 10)
),fold_cols = paste0(".folds_", 1:5)
)#> Will cross-validate 12 models. This requires fitting 180 model instances.
cv_4#> # A tibble: 12 × 28
#> Fixed Balance…¹ F1 Sensi…² Speci…³ Pos P…⁴ Neg P…⁵ AUC Lower…⁶ Upper…⁷
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 score 0.394 0.364 0.322 0.467 0.453 0.306 0.350 0.144 0.555
#> 2 score 0.397 0.293 0.244 0.55 0.41 0.327 0.389 0.164 0.615
#> 3 score 0.403 0.367 0.322 0.483 0.460 0.316 0.352 0.143 0.561
#> 4 score 0.456 0.407 0.378 0.533 0.491 0.386 0.438 0.214 0.663
#> 5 score 0.403 0.369 0.322 0.483 0.453 0.324 0.405 0.190 0.620
#> 6 score 0.456 0.368 0.311 0.6 0.516 0.365 0.426 0.205 0.648
#> 7 age 0.617 0.701 0.733 0.5 0.693 0.633 0.625 0.406 0.844
#> 8 age 0.5 NaN 0.3 0.7 NaN 0.403 0.592 0.374 0.810
#> 9 age 0.533 NaN 0.367 0.7 NaN 0.451 0.467 0.257 0.676
#> 10 age 0.542 NaN 0.333 0.75 0.617 0.441 0.6 0.371 0.829
#> 11 age 0.55 0.536 0.5 0.6 0.653 0.537 0.525 0.301 0.749
#> 12 age 0.608 0.545 0.467 0.75 0.777 0.495 0.633 0.416 0.851
#> # … with 18 more variables: Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>,
#> # `Detection Prevalence` <dbl>, Prevalence <dbl>, Predictions <list>,
#> # ROC <list>, `Confusion Matrix` <list>, Results <list>, Coefficients <list>,
#> # Folds <int>, `Fold Columns` <int>, `Convergence Warnings` <int>,
#> # `Other Warnings` <int>, `Warnings and Messages` <list>, Process <list>,
#> # HParams <list<tibble[,2]>>, Dependent <chr>, and abbreviated variable names
#> # ¹`Balanced Accuracy`, ²Sensitivity, ³Specificity, ⁴`Pos Pred Value`, …
Let’s order the models by the Balanced Accuracy
metric
(in descending order) and extract the formulas and hyperparameters:
%>%
cv_4 ::mutate(`Model ID` = 1:nrow(cv_4)) %>%
dplyr::arrange(dplyr::desc(`Balanced Accuracy`)) %>%
dplyrselect_definitions(additional_includes = c("Balanced Accuracy", "F1", "MCC", "Model ID")) %>%
kable()
Dependent | Fixed | kernel | cost | Balanced Accuracy | F1 | MCC | Model ID |
---|---|---|---|---|---|---|---|
diagnosis | age | linear | 1 | 0.6166667 | 0.7005128 | 0.2695860 | 7 |
diagnosis | age | radial | 10 | 0.6083333 | 0.5448196 | 0.2372335 | 12 |
diagnosis | age | linear | 10 | 0.5500000 | 0.5357576 | 0.1351019 | 11 |
diagnosis | age | radial | 5 | 0.5416667 | NaN | 0.0739342 | 10 |
diagnosis | age | linear | 5 | 0.5333333 | NaN | 0.0735712 | 9 |
diagnosis | age | radial | 1 | 0.5000000 | NaN | -0.0034522 | 8 |
diagnosis | score | radial | 5 | 0.4555556 | 0.4073420 | -0.1038522 | 4 |
diagnosis | score | radial | 10 | 0.4555556 | 0.3683114 | -0.1014684 | 6 |
diagnosis | score | linear | 5 | 0.4027778 | 0.3667013 | -0.2078914 | 3 |
diagnosis | score | linear | 10 | 0.4027778 | 0.3686869 | -0.2073524 | 5 |
diagnosis | score | radial | 1 | 0.3972222 | 0.2928246 | -0.2315451 | 2 |
diagnosis | score | linear | 1 | 0.3944444 | 0.3638492 | -0.2243185 | 1 |
Next, we will go through a short multiclass classification example.
For our multiclass classification example, we will use the
musicians
dataset from cvms
. It has 60
musicians, grouped in four classes (A
, B
,
C
, D
). The origins of the dataset is
classified, so don’t ask too many questions about it!
Let’s create 5 fold columns with 5
folds each. We set cat_col = "Class"
to ensure a
similar ratio of the classes in all the folds and
num_col = "Age"
to get a similar average age in
the folds. The latter is not required but could be useful if we had an
important hypothesis regarding age.
We will not need the id_col
as there is only one row per
musician ID
.
# Set seed for reproducibility
set.seed(1)
# Prepare dataset
<- musicians
data_mc "ID"]] <- as.factor(data_mc[["ID"]])
data_mc[[
# Create 5 fold columns with 5 folds each
<- fold(
data_mc data = data_mc,
k = 5,
cat_col = "Class",
num_col = "Age",
num_fold_cols = 5
)
%>% head(10) %>% kable() data_mc
ID | Age | Class | Height | Drums | Bass | Guitar | Keys | Vocals | .folds_1 | .folds_2 | .folds_3 | .folds_4 | .folds_5 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 41 | A | 178 | 1 | 0 | 1 | 0 | 1 | 3 | 2 | 5 | 2 | 5 |
2 | 62 | A | 168 | 1 | 1 | 1 | 1 | 0 | 4 | 4 | 1 | 5 | 1 |
3 | 19 | A | 163 | 1 | 0 | 0 | 1 | 1 | 1 | 3 | 2 | 1 | 2 |
4 | 52 | A | 177 | 1 | 0 | 0 | 0 | 1 | 1 | 4 | 2 | 1 | 2 |
5 | 32 | A | 155 | 1 | 1 | 0 | 0 | 1 | 3 | 3 | 4 | 4 | 4 |
6 | 44 | A | 164 | 0 | 0 | 1 | 1 | 1 | 2 | 1 | 5 | 4 | 5 |
7 | 51 | A | 173 | 1 | 1 | 1 | 1 | 0 | 4 | 3 | 4 | 3 | 4 |
8 | 42 | A | 176 | 1 | 0 | 1 | 1 | 0 | 3 | 2 | 3 | 2 | 3 |
9 | 25 | A | 171 | 0 | 1 | 0 | 1 | 0 | 5 | 1 | 1 | 3 | 3 |
10 | 60 | A | 167 | 1 | 0 | 0 | 1 | 0 | 5 | 1 | 1 | 3 | 3 |
# You can use skimr to get a better overview of the dataset
# Uncomment:
# library(skimr)
# skimr::skim(data_mc)
As the model function from the binomial example also works with more
than 2 classes, we only need to change the predict function. In
multinomial
classification, it should return a data frame
with one column per class with the probabilities of
that class. Hence, we copy the predict function from before and remove
the [[2]]
from the last line:
# Predict function for multinomial SVM
<- function(test_data, model, formula, hyperparameters, train_data) {
mc_svm_predict_fn <- predict(
predictions object = model,
newdata = test_data,
allow.new.levels = TRUE,
probability = TRUE
)
# Extract probabilities
<- dplyr::as_tibble(attr(predictions, "probabilities"))
probabilities
# Return all columns
probabilities }
With this, we can cross-validate a few formulas for predicting the
Class
. Remember, that it’s possible to run this in
parallel!
<- cross_validate_fn(
cv_5 data = data_mc,
formulas = c("Class ~ Age + Height",
"Class ~ Age + Height + Bass + Guitar + Keys + Vocals"),
type = "multinomial",
model_fn = clf_svm_model_fn,
predict_fn = mc_svm_predict_fn,
hyperparameters = list(
"kernel" = c("linear", "radial"),
"cost" = c(1, 5, 10)
),fold_cols = paste0(".folds_", 1:5)
)#> Will cross-validate 12 models. This requires fitting 300 model instances.
cv_5#> # A tibble: 12 × 26
#> Fixed Overa…¹ Balan…² F1 Sensi…³ Speci…⁴ Pos P…⁵ Neg P…⁶ Kappa
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Age+Height 0.28 0.52 0.271 0.28 0.76 0.271 0.761 0.0364
#> 2 Age+Height 0.257 0.504 0.251 0.257 0.752 0.254 0.753 0.00746
#> 3 Age+Height 0.293 0.529 0.280 0.293 0.764 0.283 0.766 0.0528
#> 4 Age+Height 0.23 0.487 0.225 0.23 0.743 0.231 0.742 -0.0249
#> 5 Age+Height 0.303 0.536 0.294 0.303 0.768 0.298 0.769 0.0682
#> 6 Age+Height 0.24 0.493 0.238 0.24 0.747 0.243 0.746 -0.0109
#> 7 Age+Height+… 0.403 0.602 0.393 0.403 0.801 0.412 0.804 0.200
#> 8 Age+Height+… 0.367 0.578 NaN 0.367 0.789 0.341 0.793 0.144
#> 9 Age+Height+… 0.4 0.6 0.389 0.4 0.8 0.393 0.803 0.193
#> 10 Age+Height+… 0.38 0.587 0.364 0.38 0.793 0.371 0.796 0.166
#> 11 Age+Height+… 0.423 0.616 0.410 0.423 0.808 0.433 0.811 0.225
#> 12 Age+Height+… 0.363 0.576 0.348 0.363 0.788 0.345 0.790 0.143
#> # … with 17 more variables: MCC <dbl>, `Detection Rate` <dbl>,
#> # `Detection Prevalence` <dbl>, Prevalence <dbl>, Predictions <list>,
#> # `Confusion Matrix` <list>, Results <list>, `Class Level Results` <list>,
#> # Coefficients <list>, Folds <int>, `Fold Columns` <int>,
#> # `Convergence Warnings` <int>, `Other Warnings` <int>,
#> # `Warnings and Messages` <list>, Process <list>, HParams <list<tibble[,2]>>,
#> # Dependent <chr>, and abbreviated variable names ¹`Overall Accuracy`, …
Let’s order the results by the Balanced Accuracy
metric
and extract the formulas and hyperparameters:
%>%
cv_5 ::mutate(`Model ID` = 1:nrow(cv_5)) %>%
dplyr::arrange(dplyr::desc(`Balanced Accuracy`)) %>%
dplyrselect_definitions(additional_includes = c(
"Balanced Accuracy", "F1", "Model ID")) %>%
kable()
Dependent | Fixed | kernel | cost | Balanced Accuracy | F1 | Model ID |
---|---|---|---|---|---|---|
Class | Age+Height+Bass+Guitar+Keys+Vocals | linear | 10 | 0.6155556 | 0.4100794 | 11 |
Class | Age+Height+Bass+Guitar+Keys+Vocals | linear | 1 | 0.6022222 | 0.3925407 | 7 |
Class | Age+Height+Bass+Guitar+Keys+Vocals | linear | 5 | 0.6000000 | 0.3886541 | 9 |
Class | Age+Height+Bass+Guitar+Keys+Vocals | radial | 5 | 0.5866667 | 0.3642633 | 10 |
Class | Age+Height+Bass+Guitar+Keys+Vocals | radial | 1 | 0.5777778 | NaN | 8 |
Class | Age+Height+Bass+Guitar+Keys+Vocals | radial | 10 | 0.5755556 | 0.3479595 | 12 |
Class | Age+Height | linear | 10 | 0.5355556 | 0.2940289 | 5 |
Class | Age+Height | linear | 5 | 0.5288889 | 0.2799472 | 3 |
Class | Age+Height | linear | 1 | 0.5200000 | 0.2706326 | 1 |
Class | Age+Height | radial | 1 | 0.5044444 | 0.2508571 | 2 |
Class | Age+Height | radial | 10 | 0.4933333 | 0.2383805 | 6 |
Class | Age+Height | radial | 5 | 0.4866667 | 0.2245827 | 4 |
In multinomial
evaluation, we perform
one-vs-all evaluations and average them (macro
metrics). These evaluations are stored in the output as
Class Level Results
:
# Extract Class Level Results for the best model
$`Class Level Results`[[11]]
cv_5#> # A tibble: 4 × 14
#> Class Balanced …¹ F1 Sensi…² Speci…³ Pos P…⁴ Neg P…⁵ Kappa Detec…⁶ Detec…⁷
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0.664 0.493 0.493 0.836 0.525 0.833 0.330 0.123 0.247
#> 2 B 0.62 0.413 0.4 0.84 0.471 0.810 0.245 0.1 0.22
#> 3 C 0.662 0.494 0.587 0.738 0.428 0.843 0.289 0.147 0.343
#> 4 D 0.516 0.240 0.213 0.818 0.306 0.757 0.0360 0.0533 0.19
#> # … with 4 more variables: Prevalence <dbl>, Support <int>,
#> # Results <named list>, `Confusion Matrix` <named list>, and abbreviated
#> # variable names ¹`Balanced Accuracy`, ²Sensitivity, ³Specificity,
#> # ⁴`Pos Pred Value`, ⁵`Neg Pred Value`, ⁶`Detection Rate`,
#> # ⁷`Detection Prevalence`
We also have the fold results:
# Extract fold results for the best model
$Results[[11]]
cv_5#> # A tibble: 5 × 13
#> Fold Colum…¹ Overa…² Balan…³ F1 Sensi…⁴ Speci…⁵ Pos P…⁶ Neg P…⁷ Kappa MCC
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 .folds_1 0.4 0.6 0.378 0.4 0.8 0.367 0.804 0.185 0.203
#> 2 .folds_2 0.383 0.589 0.377 0.383 0.794 0.374 0.796 0.173 0.178
#> 3 .folds_3 0.467 0.644 0.448 0.467 0.822 0.488 0.826 0.281 0.297
#> 4 .folds_4 0.383 0.589 0.378 0.383 0.794 0.418 0.796 0.180 0.183
#> 5 .folds_5 0.483 0.656 0.470 0.483 0.828 0.516 0.832 0.307 0.321
#> # … with 3 more variables: `Detection Rate` <dbl>,
#> # `Detection Prevalence` <dbl>, Prevalence <dbl>, and abbreviated variable
#> # names ¹`Fold Column`, ²`Overall Accuracy`, ³`Balanced Accuracy`,
#> # ⁴Sensitivity, ⁵Specificity, ⁶`Pos Pred Value`, ⁷`Neg Pred Value`
And a set of multiclass confusion matrices (one per fold column):
# Extract multiclass confusion matrices for the best model
# One per fold column
$`Confusion Matrix`[[11]]
cv_5#> # A tibble: 80 × 4
#> `Fold Column` Prediction Target N
#> <chr> <chr> <chr> <int>
#> 1 .folds_1 A A 9
#> 2 .folds_1 B A 4
#> 3 .folds_1 C A 1
#> 4 .folds_1 D A 1
#> 5 .folds_1 A B 4
#> 6 .folds_1 B B 6
#> 7 .folds_1 C B 3
#> 8 .folds_1 D B 2
#> 9 .folds_1 A C 2
#> 10 .folds_1 B C 0
#> # … with 70 more rows
We can add these together (or average them) and plot the result:
# Sum the fold column confusion matrices
# to one overall confusion matrix
<- cv_5$`Confusion Matrix`[[11]] %>%
overall_confusion_matrix ::group_by(Prediction, Target) %>%
dplyr::summarise(N = sum(N))
dplyr#> `summarise()` has grouped output by 'Prediction'. You can override using the
#> `.groups` argument.
%>% kable() overall_confusion_matrix
Prediction | Target | N |
---|---|---|
A | A | 37 |
A | B | 15 |
A | C | 9 |
A | D | 13 |
B | A | 15 |
B | B | 30 |
B | C | 7 |
B | D | 14 |
C | A | 15 |
C | B | 12 |
C | C | 44 |
C | D | 32 |
D | A | 8 |
D | B | 18 |
D | C | 15 |
D | D | 16 |
# Plot the overall confusion matrix
plot_confusion_matrix(overall_confusion_matrix, add_sums = TRUE)
This concludes the vignette. If elements are unclear or you need help to convert your model, you can leave feedback in a mail or in a GitHub issue :-)