library(ldt)
It is recommended to read the following vignettes first:
In this vignette, we talk about fraud detection. The data in this example is published as a part of a competition, in which the AUC of the winner model is 0.945884 (see Vesta Corporation (2018)). Of course, we are not going to participate in that competition and use a specific test sample and compare the performance of discrete choice modeling to machine learning approaches. This is beyond the scope of this vignette (see, e.g., Clarke, Fokoue, and Zhang (2009) for a discussion). We want to use a large data set, generally to talk about the performance of ldt when data is big.
We use Vesta Corporation (2018) and
Data_VestaFraud()
function to get the required data:
<- Data_VestaFraud(training = TRUE) vestadata
In this data set, there are two samples: train and test. We will use
the training sample in this vignette. The observations are labeled with
fraud or not fraud. There are 393 features in the
files. Furthermore, each observation has an that can link a part of the
observations to another data file with 40 identity-related features. The
combined data-set has 476 features and 281 millions data-points in which
46.1% is NA
.
Dependent and potential exogenous data are:
<- as.matrix(vestadata$data[, c("isFraud")])
y <- as.matrix(vestadata$data[, 3:length(vestadata$data)])
x <- as.numeric((y == 1) * (nrow(y) / sum(y == 1)) + (y == 0)) weight
Since the data is large and to increase the speed of the calculations, we change the default optimization options:
<- GetNewtonOptions(maxIterations = 10, functionTol = 1e-2) optimOptions
We also choose to search a small subset of the model set:
<- list(c(1), c(2), c(3), c(4:10))
xSizes <- c(NA, 20, 15, 10) xCounts
And, a relatively small out-of-sample simulation:
<- 4 simFixSize
And finally, we search for the best model:
<- DcSearch_s(
vestaRes x = x, y = y, w = weight,
xSizes = xSizes, counts = xCounts,
optimOptions = optimOptions,
searchItems = GetSearchItems(bestK = 20),
modelCheckItems = GetModelCheckItems(
maxConditionNumber = 1e15, minDof = 1e5, minOutSim = simFixSize / 2
),measureOptions = GetMeasureOptions(
typesIn = c("aucIn"),
typesOut = c("aucOut"),
simFixSize = 4,
trainRatio = 0.9,
seed = 340
),searchOptions = GetSearchOptions(printMsg = FALSE),
printMsg = FALSE,
savePre = "data/dc_vesta_"
)
We can get the in-sample and out-of-sample AUC by the following code:
print(paste0("Best In-Sample AUC: ", vestaRes$aucIn$target1$model$bests$best1$weight))
print(paste0("Best Out-Of-Sample AUC: ", vestaRes$aucOut$target1$model$bests$best1$weight))
Note that we do not evaluate the codes here due to the large data.
The computations are relatively time-consuming even if we choose a small subset of the model set. One might be able to improve our current performance by better use of categorical features (e.g., by studying them and grouping some items). Furthermore, (for development) any improvement in the speed of the calculations allows us to search a larger proportion of the data set. In the presence of categorical variables, applying a more efficient algorithm in dealing with dummy variables or sparse matrices might be helpful.