DC: Credit card fraud detection

library(ldt)

It is recommended to read the following vignettes first:

Introduction

In this vignette, we talk about fraud detection. The data in this example is published as a part of a competition, in which the AUC of the winner model is 0.945884 (see Vesta Corporation (2018)). Of course, we are not going to participate in that competition and use a specific test sample and compare the performance of discrete choice modeling to machine learning approaches. This is beyond the scope of this vignette (see, e.g., Clarke, Fokoue, and Zhang (2009) for a discussion). We want to use a large data set, generally to talk about the performance of ldt when data is big.

Data

We use Vesta Corporation (2018) and Data_VestaFraud() function to get the required data:

vestadata <- Data_VestaFraud(training = TRUE)

In this data set, there are two samples: train and test. We will use the training sample in this vignette. The observations are labeled with fraud or not fraud. There are 393 features in the files. Furthermore, each observation has an that can link a part of the observations to another data file with 40 identity-related features. The combined data-set has 476 features and 281 millions data-points in which 46.1% is NA.

Dependent and potential exogenous data are:

y <- as.matrix(vestadata$data[, c("isFraud")])
x <- as.matrix(vestadata$data[, 3:length(vestadata$data)])
weight <- as.numeric((y == 1) * (nrow(y) / sum(y == 1)) + (y == 0))

Estimation

Since the data is large and to increase the speed of the calculations, we change the default optimization options:

optimOptions <- GetNewtonOptions(maxIterations = 10, functionTol = 1e-2)

We also choose to search a small subset of the model set:

xSizes <- list(c(1), c(2), c(3), c(4:10))
xCounts <- c(NA, 20, 15, 10)

And, a relatively small out-of-sample simulation:

simFixSize <- 4

And finally, we search for the best model:


vestaRes <- DcSearch_s(
  x = x, y = y, w = weight,
  xSizes = xSizes, counts = xCounts,
  optimOptions = optimOptions,
  searchItems = GetSearchItems(bestK = 20),
  modelCheckItems = GetModelCheckItems(
    maxConditionNumber = 1e15, minDof = 1e5, minOutSim = simFixSize / 2
  ),
  measureOptions = GetMeasureOptions(
    typesIn = c("aucIn"),
    typesOut = c("aucOut"),
    simFixSize = 4,
    trainRatio = 0.9,
    seed = 340
  ),
  searchOptions = GetSearchOptions(printMsg = FALSE),
  printMsg = FALSE,
  savePre = "data/dc_vesta_"
)

We can get the in-sample and out-of-sample AUC by the following code:

print(paste0("Best In-Sample AUC:     ", vestaRes$aucIn$target1$model$bests$best1$weight))
print(paste0("Best Out-Of-Sample AUC: ", vestaRes$aucOut$target1$model$bests$best1$weight))

Note that we do not evaluate the codes here due to the large data.

Discussion

The computations are relatively time-consuming even if we choose a small subset of the model set. One might be able to improve our current performance by better use of categorical features (e.g., by studying them and grouping some items). Furthermore, (for development) any improvement in the speed of the calculations allows us to search a larger proportion of the data set. In the presence of categorical variables, applying a more efficient algorithm in dealing with dummy variables or sparse matrices might be helpful.

References

Clarke, Bertrand, Ernest Fokoue, and Hao Helen Zhang. 2009. Principles and Theory for Data Mining and Machine Learning. Springer New York, NY. https://doi.org/https://doi.org/10.1007/978-0-387-98135-2.
Vesta Corporation. 2018. “IEEE-CIS Fraud Detection.”