library(ldt)
library(kableExtra)
library(MASS)
It is recommended to read the following vignette first:
In this vignette, we draw a sample from a known population and talk about the automatic variable selection in ldt (In another vignette we use an actual data set with a similar goal). In this vignette, we focus on a simulation problem with two endogenous variables and the following DGP: \[\begin{align} \label{eq:sur-sim-model} \begin{pmatrix}y_1\\y_2\end{pmatrix}=& \begin{bmatrix} 1 & 2 & 0\\ 0 & 3 & 4 \end{bmatrix}\begin{pmatrix}x_1\\x_2\\x_3\end{pmatrix} + \begin{pmatrix}e_1\\e_2\end{pmatrix}\\ &\begin{pmatrix}e_1\\e_2\end{pmatrix}=s \begin{bmatrix} 2 & 1\\ 0 & 2 \end{bmatrix}\begin{pmatrix}\epsilon_1\\\epsilon_2\end{pmatrix},\quad \epsilon_i\sim N(0,1),\\ &x_i\sim N(\frac{i}{3},1), \end{align}\]
where \(y_1\) and \(y_2\) are endogenous variables, \(x_i\) for \(i=1,\dots,3\) are exogenous variables, \(s\) is a constant parameter, \(\epsilon_1\) and \(\epsilon_2\) are independent, and \(e_1\) and \(e_2\) are disturbances. It is easy to see that \(\operatorname{var}(e_1)=5s^2\), \(\operatorname{var}(e_2)=4s^2\), and \(\operatorname{cov}(e_1,e_2)=2s^2\). Also, note that the model is restricted and each dependent variable is explained by 2 variables.
Furthermore, we assume that there is a sample of size \(n\) for \(y_1\) and \(y_2\) and 18 other variables denoted by \(z_i\), where, \[\begin{equation} z_i\sim N(\frac{i}{18},1),\quad \text{ for } i=1,\dots,18. \end{equation}\] We assume that \(\{z_i\}\) is the set of the potential regressors. Note that \(z_{6}\), \(z_{12}\) and \(z_{18}\) are respectively equal in distribution to \(x_1\), \(x_2\) and \(x_3\) (the means are \(1/3\), \(0.5\) and \(1\), respectively).
In this example, we focus on the ability to select the true random variables and to estimate the parameters properly, in some scenarios on the \(s\) and \(n\):
<- c(30, 50)
sampleSizes <- c(0.1, 2) sMultipliers
Note that a larger value for \(s\)
(element of sMultipliers
) means more dispersed and more
correlated disturbances and a larger value for \(n\) (element of sampleSizes
)
means more degrees of freedom.
We assume that the number of endogenous variables is known.
Furthermore, to decrease the size of the model set, we assume that we
know that the number of regressors is not larger than 3. In other words,
we will use the following data (If you are not familiar with the
arguments of the SurSearch_s()
function, please read this vignette first):
<- list(c(1, 2), c(3))
xSizes <- c(NA, 5)
xCounts <- list(c(1, 2)) yGroups
The following function can draw a sample from the described population:
<- function(n, s) {
GetSample <- NULL
z for (i in (1:18)) {
<- cbind(z, rnorm(n, i / 18, 1))
z
}<- as.matrix(z)
z <- z[, c(6, 12, 18)]
x <- as.matrix(mvrnorm(n, c(0, 0), s * s * matrix(c(5, 2, 2, 4), 2, 2)))
e <- x %*% t(matrix(c(1, 2, 0, 0, 3, 4), 2, 3, byrow = TRUE)) + e
y colnames(y) <- c("y1", "y2")
colnames(z) <- paste0("z", seq(1, 18, 1))
return(list(y = y, z = z))
}
Note that with x = z[,c(6,12,18)]
we are assuming
equality of random variables (not just equality in distribution).
In each simulation, a sample is drawn randomly. Therefore, we need a seed for the random number generator:
<- 340
seed set.seed(seed)
Given the sample, we use SurSearch_s()
function to find
the best model. This function will select one best mode for each of the
evaluation methods. We use the results for AIC, SIC, RMSE, and CRPS
measures:
<- GetMeasureOptions(
measureOptions typesIn = c("aic", "sic"),
typesOut = c("rmse", "crps")
)<- c("AIC", "SIC", "RMSE", "CRPS") measureNames
If you are not familiar with the arguments of the
SurSearch_s()
function, please read this vignette first. The
measureNames
vector is just a display name that we will use
later.
In order to compare the performance of different measures, we use the following scoring schema:
<- function(indices1, indices2) {
GetScore if (length(indices1) == 2 && length(indices2) == 2 &&
all(indices1 == c(6, 12)) && all(indices2 == c(12, 18))) {
return(1)
}<- 0
score if (any(indices1 == 6)) score <- score + 0.1
if (any(indices1 == 12)) score <- score + 0.1
if (any(indices2 == 12)) score <- score + 0.1
if (any(indices2 == 18)) score <- score + 0.1
return(score)
}
This schema uses the indices of the regressors in the (restricted)
best model and calculates the score. If the indices belong to the true
model (i.e., indices1=c(6,12)
and
indices2=c(12,18)
), the score is 1. otherwise each correct
index (i.e., 6 and 12 for the first equation and 12 and 18 for the
second equation) adds 0.1 points to the score. For example, this means
the maximum score for an incorrect model is 0.4. Such a model has all
the true explanatory variables and some irrelevant ones. The minimum
score is zero. This is the score of a model with no correct explanatory
variable.
The SurSearch_s()
function also reports the inclusion
weights. The previous function is used to get the score of such a list.
In this case, indices1
and indices2
are the
index of the first 2 explanatory variables in the inclusion list.
Before we continue, we need a helper function to get the required indices from the estimations:
<- function(measureRes, measureSumRes) {
helper_GetIndices <- measureSumRes$target1$model$bests$best1$coefs$isRestricted
isRest1 <- measureSumRes$target2$model$bests$best1$coefs$isRestricted
isRest2
if (is.null(isRest1)) {
<- sort(measureRes$target1$model$bests$best1$exoIndices)
mindices1 else {
} <- sort(measureRes$target1$model$bests$best1$exoIndices[isRest1[, 1] != 1])
mindices1
}if (is.null(isRest2)) {
<- sort(measureRes$target2$model$bests$best1$exoIndices)
mindices2 else {
} <- sort(measureRes$target2$model$bests$best1$exoIndices[isRest2[, 2] != 1])
mindices2
}
<- nrow(measureRes$target1$model$inclusion)
end <- sort(measureRes$target1$model$inclusion[3:end, 1, drop = F],
iindices1 index.return = TRUE, decreasing = TRUE
)<- sort(measureRes$target2$model$inclusion[3:end, 1, drop = F],
iindices2 index.return = TRUE, decreasing = TRUE
)
return(list(
model1 = sort(mindices1), model2 = sort(mindices2),
inclusion1 = sort(iindices1$ix[1:2]), inclusion2 = sort(iindices2$ix[1:2])
)) }
The central piece of this experiment is the following function in which we generate a sample, search for the best models for each performance measure, and calculate the score of the best models:
<- function(i, res, n, s, xSizes, xCounts, yGroups,
SampleAndEstimate
measureOptions, searchItems, checkItems) {<- GetSample(n, s)
sample <- SurSearch_s(
search_res y = sample$y, x = sample$z, numTargets = 2,
xSizes = xSizes, counts = xCounts,
yGroups = yGroups,
searchSigMaxIter = 1,
searchSigMaxProb = 0.1,
searchItems = searchItems,
measureOptions = measureOptions,
modelCheckItems = checkItems,
searchOptions = GetSearchOptions(printMsg = FALSE),
savePre = NULL
)
<- summary(search_res, y = sample$y, x = sample$z, test = FALSE, printMsg = FALSE)
result_sum if (i < 0) {
return(list(search_res = search_res, result_sum = result_sum))
}
<- 0
j for (mea in c("aic", "sic", "rmse", "crps")) {
<- j + 1
j <- helper_GetIndices(search_res[[mea]], result_sum[[mea]])
h $model[[i, j]] <- res$model[[i, j]] + GetScore(h$model1, h$model2)
res$inclusion[[i, j]] <- res$inclusion[[i, j]] + GetScore(h$inclusion1, h$inclusion2)
res
}return(res)
}
Finally, we run the simulation:
<- 2
simCount <- list()
sur_sim_res for (n in sampleSizes) {
<- list(model = matrix(0, length(sMultipliers), length(measureNames),
res dimnames = list(paste0("sigma=", sMultipliers), measureNames)
))$inclusion <- res$model
res
<- 0
i for (s in sMultipliers) {
<- i + 1
i for (iter in seq(1, simCount)) {
<- seed + iter
seedi $seed <- seedi
measureOptions$simFixSize <- 6
measureOptions$trainRatio <- 0.75
measureOptions<- GetSearchItems(model = TRUE, type1 = FALSE, bestK = 10, inclusion = TRUE)
searchItems <- SampleAndEstimate(
res
i, res, n, s, xSizes, xCounts, yGroups, measureOptions,NULL
searchItems,
)
}
}$model <- res$model / simCount
res$inclusion <- res$inclusion / simCount
reslength(sur_sim_res) + 1]] <- list(n = n, sMultipliers = sMultipliers, res = res)
sur_sim_res[[ }
You can increase the number of simulations (We choose a small number
to decrease the computation time). Note that for each
n
\(\in\)sampleSizes
and each
s
\(\in\)sMultipliers
, we sample
and estimate 2 times. This is relatively time-consuming.
Comparing the performance of different measures in selecting the true explanatory variables can be interesting. Therefore, in the following plot we draw the scores of the best models:
Scores in finding the true model (n=30, 50)
There is a similar result for the inclusion weights:
Scores in finding the true inclusion variables (n=30, 100, 100)
Apart from finding the correct variables, another important task is estimating the unknown parameters. ldt provides 4 types of information about a parameter: - estimated coefficient and its estimated variance; - the value of the cumulative distribution function at some specific points; - extreme bounds; and - first four moments of the combined distribution. We specify these options by the following code:
<- GetSearchItems(
searchItems model = FALSE, type1 = TRUE,
bestK = 1, cdfs = seq(0.8, 1.8, 0.03),
extremeMultiplier = 2, mixture4 = TRUE
)
In SurSearch
, type1
information is the
estimated parameters. Note that in order to set a proper value for
cdfs
, we need to run the code (at least) twice. This is
because we need to get some general idea about the range of values of
estimated coefficients (e.g., by studying the extreme bounds analysis
result).
Next, we select the values of \(n\) and \(s\):
<- 1
s <- 50 n
Next, we want to estimate two types of coefficients: an unrestricted search and a restricted one in which we discard a part of the model set with low. We assess the performance by using AIC:
<- GetModelCheckItems(maxAic = 10) checkItems
Note that we adjusted the value of maxAic
by running the
code multiple times. This is because, at the first stage, we have no
information about the magnitude of the AICs.
<- list(
sur_sim_one unr = SampleAndEstimate(
-1, NULL, n, s, c(3), c(NA), list(c(1, 2)), measureOptions,
searchItems, checkItems
),res = SampleAndEstimate(
-1, NULL, n, s, c(3), c(NA), list(c(1, 2)), measureOptions,
searchItems, checkItems
) )
The number of estimated regressions for the unrestricted and restricted model sets are 136 and 18, respectively.
In this example, we present the results for one of the estimated
parameters: \(\frac{\rho y_1}{\rho
x_1}\) in the following figure. Note that the actual value of
this parameter is 1.0 (see the equation in the first section). Also,
note that we use ldt::PlotCoefs()
and
ldt::GetGldFromMoments()
functions.
Estimated coefficient (searches with and without restricting AIC)