library(ldt)
library(kableExtra)
It is recommended to read the following vignettes first:
In ldt, we automatically explain or automatically predict one or more than one random variable. In this vignette, we design a model set for predicting loan default. Of course, we focus on comparing the performance of logit and probit models. We use AUC as a performance measure:
<- GetMeasureOptions(
measureOptions typesIn = c("aucIn", "costMatrixIn"),
typesOut = c("aucOut", "costMatrixOut")
)
Note that, we calculate both in-sample and out-of-sample AUC. Also, we calculate the error ratio by using the following cost matrix:
\[\begin{equation} \label{eq:error-ratio} \begin{bmatrix} 0.5 & 1 & 1\\ 1.0 & 0 & 0 \end{bmatrix} \end{equation}\]
which is:
<- matrix(c(0.5, 1, 1, 0, 1, 0), 2, 3) costMatrix
Note that this is not a favorable cost matrix in an actual application. One might want to define more thresholds or increase the costs in the third column.
What is a cost matrix in ldt? The general form of a cost matrix in binary case is: \[\begin{equation} \begin{bmatrix} t_1 & c_{11} & c_{12}\\ t_2 & c_{21} & c_{22}\\ \vdots & \vdots&\vdots\\ t_n & c_{n1} & c_{n2}\\ \end{bmatrix} \end{equation}\] In this presentation, \(t_i\) for \(i=1,\ldots,n\) is the threshold for the predicted probability. If the actual value is negative (i.e., \(y_i=0\)), the cost is determined by the first column. Otherwise (i.e., \(y_i=1\)), the cost is determined by the third column.
In this vignette, we use Berka and Sochorova
(1993) data-set and ldt::Data_BerkaLoan()
function
to get a sample for the dependent variable and the potential predictors
(or, features):
<- Data_BerkaLoan(positive = c("B", "D"), negative = c("A", "C"))
data #data <- Data_BerkaLoan(positive = c("B"), negative = c("A"))
The data set has a loan table with 682 observations, each labeled as one of the following:
vig_data$berka$B_finished
%);Numbers in the parenthesis show the percentage of data in each class. Each loan observation has an account identification that can provide other types of information from other tables, such as the characteristics of the account of the loan and its transactions. Furthermore, each account has a district identification that can provide information about the demographic characteristics of the location of its branch. The combined table has features (including the label) and observations.
For this example, both finished and running (without default) classes are considered to be negative and both finished and running with default classes to be positive observations. Note that the observations labeled running might introduce measurement errors. Without them, the length of the table is smaller. If you do not want to use this part of the data, uncomment the code in the previous chunk.
The dependent and potential exogenous variables are:
<- data[, c("label"), drop = FALSE]
y <- data[, 4:ncol(data)] x
Note that the first 2 columns of data
are
id
and status
.
We are not able to load the data in this vignette, because it needs an external data set and this is not available in this package. But, a part of the data set is saved in the ldt package, and we load:
= ldt::vig_data$berka$x
x = ldt::vig_data$berka$y y
If you have downloaded the data set files, do not run this code. Since only 22.28739% of the observations are positive, we define and use the following weight vector to balance the data:
<- as.numeric((y == 1) * (nrow(y) / sum(y == 1)) + (y == 0)) weight
There are 10 potential predictors and the size of the potential
predicting models is relatively large. We follow a step-wise search
approach by defining the following two arguments for the
DcSearch_s()
function:
<- list(c(1, 2), c(3))
xSizes <- c(NA, 4) xCounts
We also need a seed for the RNG and some other options to define the out-of-sample prediction:
$seed <- 340
measureOptions$simFixSize <- 10
measureOptions$trainRatio <- 0.75 measureOptions
Note that the out-of-sample simulation depends on random sampling. Finally, we start the search function:
<- list(
berka_res logit = DcSearch_s(
x = x, y = y, w = weight, costMatrices = list(costMatrix),
xSizes = xSizes, counts = xCounts,
searchLogit = TRUE, searchProbit = FALSE,
searchItems = GetSearchItems(bestK = 20, inclusion = TRUE),
measureOptions = measureOptions,
searchOptions = GetSearchOptions(printMsg = FALSE),
savePre = NULL
),probit = DcSearch_s(
x = x, y = y, w = weight, costMatrices = list(costMatrix),
xSizes = xSizes, counts = xCounts,
searchLogit = FALSE, searchProbit = TRUE,
searchItems = GetSearchItems(bestK = 20, inclusion = TRUE),
measureOptions = measureOptions,
searchOptions = GetSearchOptions(printMsg = FALSE),
savePre = NULL
) )
Since we want to compare the performance of logit and probit models,
we run two discrete choice searches. All options are similar, but one is
with searchLogit = TRUE
and the other is with
searchLogit = FALSE
and searchProbit = TRUE
.
The results are reported in the following plot:
Comparing the performance of best logit model and best probit model (cost-matrix and AUC)