SUR: Determinants of long-run economic growth

library(ldt)
library(kableExtra)

Introduction

This vignette is an example in which we talk about the determinants of GDP per capita in different economies in the long run. In a typical application in ldt, the goal is to automatically explain or automatically predict one or more than one random variable. This example is about the former.

We try to explain the long-run yearly growth rates of GDP per capita by using the long-run yearly growth rates of other variables. These potential explanatory variables (or potential regressors) are generally related to macroeconomics. More information is provided in the other sections.

What is the long-run yearly growth rate of a variable? Let \(x\) be a level variable for which a time-series sample \(\{x_t\}_{0}^{s}\) with yearly frequency is available. We define (continuous) long-run yearly growth rate to be given by \(\log(x_s/x_0)/s\times 100\) formula.

One complication that arises is the endogeneity problem. To avoid this, one approach is to split the data and try to explain the growth after a specific point in time by using the information before that point. Due to the properties of the data (which is explained in the next section), we choose the following split year:

splitYear <- 2005

This means that the goal is to explain the long-run GDP per capita growth after 2005 by using the information before this year. By information before 2005, we mean the long-run growth rate of all available data, including the GDP per capita itself.

Data

One of the main ideas behind ldt is to minimize the user’s discretion. Therefore, like many other applications in ldt, this example is based on a data-set and a set of rules that converts this data set to a list of potential regressors and/or predictors. Note that, apart from avoiding discretion, a rule-based approach in selecting the data might be expected because of the word: automatically (see the previous section).

Anyway, we select the Bank (2022) and automatically create the table of the potential regressors. For this particular data set, we use ldt::Data_Wdi() function. Note that for replication, you should download the data set from the WDI website. See the documentation of Data_Wdi() function for more information. Note that the size of the WDI data set is large and Data_Wdi() function is generally time-consuming.

wdi_lr_2005 <- Data_Wdi(
  dirPath = path_to_wdi_directory,
  minYear = 1960, maxYear = splitYear,
  aggFunction = function(data, code, name, unit, definition, aggMethod) {
    isPerc <- unit == "%" || grepl(".ZG", code)
    if (isPerc) NA else LongrunGrowth(data, 20, 2, FALSE, TRUE, isPerc)
  },
  keepFunction = function(X) {
    var(X, na.rm = TRUE) > 1e-12 && sum((is.na(X)) == FALSE) >= 50
  }
)

These lines of code aggregate the data between 1960 and 2005. The aggFunction argument defines the aggregation type. It uses different types of information (e.g., unit or code) and excludes some series from the analysis. In this implementation, if unit of a series is a percentage, it is excluded from the analysis. Note that, in order to calculate the long-run growth rates, we use ldt::LongrunGrowth() function. Since we are dealing with an unclean data set, this function helps us to handle missing data-points (see the documentation). Furthermore, keepFunction argument excludes series with relatively large number of NAs or relatively small variances.

As it is pointed out in the previous section, we need two data-tables, one with data before 2005 and the other with data after this year. Dependent variable is the long-run GDP per capita growth after 2005. We use a similar code to create it:

wdi_lr_2006_21 <- Data_Wdi(
  dirPath = path_to_wdi_directory,
  minYear = splitYear + 1, maxYear = 2021,
  aggFunction = function(data, code, name, unit, definition, aggMethod) {
    isPerc <- unit == "%" || grepl(".ZG", code)
    if (isPerc) NA else LongrunGrowth(data, 2, 4, FALSE, TRUE, isPerc)
  },
  keepFunction = function(X) {
    var(X, na.rm = TRUE) > 1e-12 && sum((is.na(X)) == FALSE) >= 50
  }
) 

What remains is cleaning and rearranging data. we use ldt::Data_WdiSearchFor() function to find the target variable:

serCode <- Data_WdiSearchFor(wdi_lr_2005$series, c("GDP", "per capita", "constant", "us"))[[1]]$code
y <- wdi_lr_2006_21$data[, which(colnames(wdi_lr_2006_21$data) %in% c(serCode)), drop = FALSE]
x <- as.data.frame(wdi_lr_2005$data)

We remove highly correlated columns and non-country observations:

x <- x[, which(sapply(x, function(v) abs(cor(y, v, use = "complete.obs")) < 0.999)), drop = FALSE]
y <- y[wdi_lr_2005$countries$isCountry, , drop = FALSE]
x <- x[wdi_lr_2005$countries$isCountry, , drop = FALSE]
if (any(rownames(x) != names(y))) {
  stop("Invalid operation.")
}

We rearrange data such that the lag of the dependent variable is in the first column. Finally, we add intercept to be one of the potential regressors:

ind <- which(colnames(x) %in% c(serCode))
x <- cbind(x[, ind, drop = F], x[, 1:(ind - 1), drop = F], x[, ((ind + 1):ncol(x)), drop = F])
x <- cbind(matrix(rep(1, nrow(x)), nrow(x), dimnames = list(NULL, "Intercept")), x)

A this point, x is a table with countries in the rows and variables in the columns.

Model-set

In this example, there are one target variable and (originally) 571 potential regressors. Different combinations of the variables create different (potential) regression models. The set of all linear regression models is relatively large. For example, the following code approximates the size of such a set in which the regressions have at least 6 explanatory variables:

n0 = vig_data$wdi$nxcol_orig
n0Choose6 <- sum(sapply(seq(1:6), function(i) choose(n0, i)))
print(prettyNum(n0Choose6, big.mark = ","))
  > [1] "4.738693e+13"

In other words, our generosity in defining new potential regressors is computationally (and exponentially) costly. We can restrict ourselves to small regression models or use theoretical information to exclude a part of the potential regressors from the analysis.

Which subset of the model set should we estimate?

Since ldt tries to be an atheoretical tool, we restrict the model set. ldt::SurSearch_s() function has a step-wise search algorithm. It estimates small regressions, and given a list of criteria, it omits regressors with poor performance. In other words, larger models are not built with variables with relatively low explanatory power in the smaller models.

The steps of SurSearch\_s() function is defined by a list of indices and a vector of sizes:

xSizes <- list(c(2,3), c(4))
xCounts <- c(NA, 5)

This means that at the first step, we use all data and estimate models with 2 potential regressors; Next, we select 5 best potential regressors and estimate models with 4 potential regressors.

What do we mean by best potential regressors?

For a given regression model, ldt calculates two types of scores: in-sample and out-of-sample. The latter is calculated in a pseudo-out-of-sample simulation (e.g., by calculating the distance between the projected value and the actual one), and the former is calculated by using all the sample data (e.g., AIC of the regression). We select a set of options by the following code:

measureOptions <- GetMeasureOptions(
  typesIn = c("aic", "sic"), typesOut = c("rmse", "crps"),
  simFixSize = 60, trainRatio = 0.75, seed = 340
)

In other words, we are asking ldt to use AIC and SIC as the two in-sample criteria, and RMSE and CRPS as the two out-of-sample ones. There are 60 number of iterations in the simulation and 340 is used as the seed. In each iteration, 75 percent of the sample is used for estimation and the rest is used for testing.

What type of information are we looking for?

Given a regression model, one might discuss the value of the estimated coefficients, its sign or statistical significance, or one might use it for projection. When model uncertainty presents, one might also look for inclusion weights. While ldt provides a range of options, in this example, we focus on the following:

searchItems <- GetSearchItems(model = TRUE, bestK = 50, inclusion = TRUE)

In other words, we are asking ldt to keep the first 50 best models and also, the inclusion weights.

How can we deal with high levels of multicollinearity?

When it comes to an automatic approach toward modeling, We might want to exclude regressions with high levels of multicollinearity, low degrees of freedom (in presence of NAs), relatively large AIC, etc. ldt provides some options in this regard. In the current example, we use the following code:

checkItems <- GetModelCheckItems(maxConditionNumber = 1e15, minDof = 35, minOutSim = 50)

In other words, we are asking ldt to omit a regression with a degrees-of-freedom less than 35, a condition number larger than 10^{15}, or a number of valid out-of-sample simulations less than 50. We hope that restricting the condition number can deal with the multicollinearity problem.

Another approach in dealing with multicollinearity is to partition the regressors and skip the regressions with at least two regressors in the same partition. We can use theory to create the partitions or we can cluster the variables based on a specific statistical distance. Note that such partitioning reduces the number of potential models and therefore speeds up the calculations. Of course, in this example partitions have just one variable.

How can we estimate all models with an intercept?

We would like to include the intercept and the lag of the dependent variable in all the regressions. ldt allows us to fix the first \(k\) partitions. Since the two first variables are intercept and the lag of the dependent variable, in the next section, we set numFixXPartitions argument to be 2.

Estimation

We are not able to load the data in this vignette, because it needs an external data set and this is not available in this package. But, a part of the data set is saved in the ldt package, and we load:

x = ldt::vig_data$wdi$x
y = ldt::vig_data$wdi$y

If you have downloaded the data set files, do not run this code.

search_res <- SurSearch_s(
  x = x, y = y, numFixXPartitions = 2,
  xSizes = xSizes, counts = xCounts,
  searchItems = searchItems, modelCheckItems = checkItems, measureOptions = measureOptions,
  searchOptions = GetSearchOptions(parallel = TRUE, printMsg = FALSE)
)

The arguments are discussed in the previous sections. Note that SurSearch_s() can save the result of different steps in the working directory.

Status

We can print the results to get some general information about the search process. It prints the number of valid or failed estimations or the weight of the best models:

print(search_res)
  > method: sur 
  > expected: 12, searched: 12 (100%), failed: 0 (0%)
  > --------
  > 1. aic:
  >  NY.GDP.PCAP.KD (best=4.042)
  > 2. sic:
  >  NY.GDP.PCAP.KD (best=4.085)
  > 3. rmse:
  >  NY.GDP.PCAP.KD (best=0.047)
  > 4. crps:
  >  NY.GDP.PCAP.KD (best=0.016)
  > NULL

Note that the weights are related to the in-sample and out-of-sample criteria discussed before. See the documentation of ldt::GetWeightFromMeasure() or GetMeasureFromWeight() functions. Also, an error occurred in 0% of the searched models. Note that this number may increase if we apply more strict options in modelCheckItems argument.

Best models

While the best weights are informative, we generally want more information about the best models. The returned value (which is an ldtsearch object) provides some general information such as the indices of the exogenous variables. The summary function can be used to estimate the best models:

search_res_sum <- summary(search_res, y = y, x = x, test = FALSE, printMsg = FALSE)

However, this returned value contains lots of information and we need a way of summarizing them. If the goal is to discuss the coefficients, ldt::CoefTable() function can be used to create a table. First, we need to create a list of models:

best_models <- list(
  aic = search_res_sum$aic$target1$model$bests$best1,
  rmse = search_res_sum$rmse$target1$model$bests$best1,
  crps = search_res_sum$crps$target1$model$bests$best1
)

Next, we should choose the types of information from the regressions, we like to display. It gets done based on some keywords from the documentation of the CoefTable() function. For example:

regInfo <- list(
  c("", ""), c("num_obs", "No. Obs."), c("num_x", "No. Exo."), c("sigma2", "S.E. of Reg."), 
  c("r2", "R Sq."), c("aic", "AIC"), c("sic", "SIC"), c("rmse", "RMSE"), c("crps", "CRPS")
)

Next, we retrieve the names of the variables from the data set. We also need some adjustments to escape some characters and to get a more informative table:

vnamefun_wdi <- function(x) {
  tryCatch(
    {
      Data_WdiSearchFor(wdi_lr_2005$series, keywords = c(x), searchName = FALSE, findOne = TRUE)$name
    },
    error = function(e) x
  )
}
hnameFun <- function(x) {
  paste0((if (x == "aic") "AIC" else if (x == "sic") "SIC" else if (x == "rmse") "RMSE" else if (x == "crps") "CRPS"), "-based")
}
replaces <- list(c("US$", "US\\$"))

Finally, we can get the table and draw it by using kableExtra package:

res <- CoefTable(best_models,
                 tableFun = "coef_star", hnameFun = hnameFun, vnamesFun = vnamefun_wdi, regInfo = regInfo,
                 vnamesFun_sub = replaces, vnamesFun_max = 40, formatNumFun = function(j,x)round(x,3),
                 formatLatex = FALSE
)
Long-run GDP per capita growth: best regressions
AIC-based RMSE-based CRPS-based
AG.AGR.TRAC.NO 0.087 0.14** 0.14**
AG.CON.FERT.PT.ZS 0.197** 0.189* 0.189*
AG.CON.FERT.ZS 0.194**
AG.LND.CROP.ZS 0.073 0.265 0.265
No. Obs. 39 40 40
No. Exo. 4 3 3
S.E. of Reg. 3.169 3.58 3.58
R Sq. 0.149 0.018 0.018
AIC 4.042 4.163 4.163
SIC 4.085 4.205 4.205
RMSE 0.08 0.053 0.053
CRPS 0.02 0.017 0.017
Note:

Dependent variable is the long-run yearly growth rates of GDP per capita of different countries after 2006. Regressors are selected automatically. Column headers indicate the criteria used to select the best regression. Data is from WDI. Parameters of the equations are reported at the top and other properties at the bottom. ***, **, and * denotes significance level of 1%, 5% and 10%, respectively.

Note that the goal of this vignette is not to interpret or discuss the coefficients of the best models.

Principle component analysis

Another practical approach in dealing with a large number of potential regressors in a regression analysis is to use Principal Component Analysis (PCA). The estimation process is much faster than the previous search approach. However, there are some drawbacks. For example, the results might be less instructive, and missing observations or determining the number of principal components might become a problem.

To calculate the principle components, we should omit incomplete observations (rows with at least one NA). However, before removing the rows, we can omit a variable with a relatively large number of missing data points (columns with many NAs). In other words, to prepare the data for PCA, there is a trade-off between the number of variables and the number of observations. One approach is to try to maximize the overall number of observations. In this example, we use ldt::RemoveNaStrategies() and estimate two types of models: one with a relatively large number of observations, and another one with a relatively large number of explanatory variables. First, we get the available strategies:

x_ <- x[, 3:ncol(x), drop = FALSE]
strategies <- RemoveNaStrategies(x_)

Each strategy specifies the number of rows or columns that should be removed and also the order. Note that we want to exclude the intercept and the lag of the dependent variable from PCA. The NA strategies are sorted by the overall number of observations. We can print the number of variables and select some of them:

cat(unlist(sapply(strategies, function(s) s$nCols))[1:20], " ...\n")
  > 7 8 8 6 8 8 8 8 8 8 8 8 8 8 8 8 5 7 8 8  ...

These numbers are the number of variables in each strategy. While the choice is not straightforward, we choose the first strategy (with 7 variables and 1001 observation) and the 2nd strategy (with 8 variables and 848 observation) to fulfill our goal. Of course, since we want a sensitivity analysis on the number of principal components too, we select two values as the thresholds of the cumulative variance ratio. In other words, we use the following combination:

inds <- c(1, 2, 1, 2)
pcaCutoffRates <- c(0.95, 0.95, 0.6, 0.6)
xPcaOptions <- GetPcaOptions(ignoreFirst = 2, exactCount = 0, max = 8)

This means we will estimate 6 models by combining the first two NA strategies and three cumulative variance ratios: 0.95, 0.9, and 0.6. The third line of code is just a setup for ignoring the first two regressors and choosing the number of PCs while estimating the model. We restrict the number of PCs to be less than 8. This is a relatively large number and therefore, as the next step, we ask ldt to omit insignificant coefficients and re-estimate the model, and repeat this process 2 times:

searchSigMaxIter <- 2
searchSigMaxProb <- 0.2

Note that we choose a relatively large value for the type-I error (i.e., 0.2). Next, we prepare the data for each case and use ldt::SurEstim() function to estimate and create a list of the estimated models:

pca_models <- list()
i <- 0
for (ind in inds) {
  i <- i + 1
  strat <- strategies[[ind]]
  x0 <- NULL
  y0 <- y[-strat$rowRemove, , drop = FALSE]
  if (strat$colFirst) {
    if (length(strat$colRemove)>0)
      x0 <- x_[, -strat$colRemove, drop = FALSE]
    x0 <- x0[-strat$rowRemove, , drop = FALSE]
  } else {
    x0 <- x_[-strat$rowRemove, , drop = FALSE]
    if (length(strat$colRemove)>0)
      x0 <- x0[, -strat$colRemove, drop = FALSE]
  }
  x0 <- cbind(x[, c(1, 2), drop = F][-strat$rowRemove, , drop = FALSE], x0)
  xPcaOptions$cutoffRate <- pcaCutoffRates[[i]]
  pca_models[[length(pca_models) + 1]] <- SurEstim(y0, x0,
                                                   addIntercept = FALSE,
                                                   searchSigMaxIter = searchSigMaxIter,
                                                   searchSigMaxProb = searchSigMaxProb,
                                                   pcaOptionsX = xPcaOptions,
                                                   simFixSize = measureOptions$simFixSize,
                                                   simTrainRatio = measureOptions$trainRatio, simSeed = abs(measureOptions$seed),
                                                   simMaxConditionNumber = checkItems$maxConditionNumber,
                                                   printMsg = FALSE
  )
}

Note that the main difficulty here is to prepare the data. It arises because we want to exclude the first two variables from the PCA. Also, note that some options such as the number of simulations are defined in the previous section.

Finally, similar to the previous section, we use the list of estimations and create a table:

regInfo <- append(regInfo, 
                  list(c("num_exo", "No. Exo. (orig.)"), 
                       c("pca_x_cutoff", "PCA Cutoff")), 2)
regInfo <- append(regInfo, list(c("", "[. . .]")), 0)
regInfo <- append(regInfo, list(c("num_rest", "No. Rest.")), 6)
res <- CoefTable(pca_models,
                 tableFun = "coef_star", hnameFun = function(x) x, vnamesFun_sub = replaces,
                 vnamesFun = vnamefun_wdi, vnamesFun_max = 16, regInfo = regInfo, 
                 numCoefs = 2, formatNumFun = function(j,x)round(x,3),
                 formatLatex = FALSE
)
colnames(res) <- paste0("mod.", seq(1, length(pca_models))) 
Long-run GDP per capita growth: regressions with PCA
mod.1 mod.2 mod.3 mod.4
AG.AGR.TRAC.NO 0.142*** 0.149*** 0.182*** 0.182***
AG.CON.FERT.P… 0(r) 0(r) 0.148 0.148
[. . .]
No. Obs. 39 39 39 39
No. Exo. (orig.) 9 10 9 10
PCA Cutoff 0.95 0.95 0.6 0.6
No. Exo. 6 7 4 4
No. Rest. 5 3 2 2
S.E. of Reg. 4.129 3.414 3.883 3.883
R Sq. -0.11 0.083 -0.043 -0.043
AIC 4.307 4.117 4.246 4.246
SIC 4.35 4.16 4.288 4.288
RMSE 0.1 0.097 0.104 0.104
CRPS 0.022 0.022 0.023 0.023
Note:

Dependent variable is the long-run yearly growth rates of GDP per capita of different countries after 2006. Regressors are the PC of different combinations of the variables. Each column belongs to a specific strategy for removing the NAs and selecting the number of PCs. The intercept and the lag of the dependent variable are excluded from PCA. Other coefficients are not reported. Data is from WDI. ***, **, and * denotes significance level of 1%, 5% and 10%, respectively. (r) means the coefficient is restricted.

Again the goal of this vignette is a presentation and therefore, I do not compare or discuss the results.

References

Bank, World. 2022. “World Development Indicators.”