library(fRLR)
This R package aims to fit Repeated Linear Regressions in which there are some same terms.
Let’s start with the simplest situation, we want to fit a set of regressions which only differ in one variable. Specifically, denote the response variable as y, and these regressions are as follows.
y∼x1+cov1+cov2+…+covmy∼x2+cov1+cov2+…+covm⋅∼⋯y∼xn+cov1+cov2+…+covm
where covi,i=1,…,m are the same variables among these regressions.
Intuitively, we can finish this task by using a simple loop.
= vector(mode='list', length=n)
model for (i in 1:n)
{
...= lm(y~x)
model[[i]]
... }
However, it is not efficient in that situation. As we all know, in the linear regression, the main goal is to estimate the parameter β. And we have
ˆβ=(X′X)−1X′Y
where X is the design matrix and Y is the observation of response variable.
It is obvious that there are some same elements in the design matrix, and the larger m is, the more elements are the same. So I want to reduce the cost of computation by separating the same part in the design matrix.
For the above example, the design matrix can be denoted as X=(x,cov). If we consider intercept, it also can be seen as the same variable among these regression, so it can be included in cov naturally. Then we have
(X′X)−1=[x′xx′covcov′xcov′cov]=[av′vB]
Woodbury formula tells us
(A+UCV)−1=A−1−A−1U(C−1+VA−1U)−1VA−1
Let
A=[aOOB],U=[10Ov],V=[0v′1O]
and C=I2×2. Then we can apply woodbury formula,
(X′X)−1=(A+UCV)−1=A−1−A−1U(C−1+VA−1U)−1VA−1
where
A−1=[a−1OOB−1]
We can do further calculations to simplify and obtain the following result
(X′X)−1=[1/a+aa−v′B−1vv′B−1v−v′B−1a−v′B−1v−B−1va−v′B−1vB−1+−B−1vv′B−1a−v′B−1v]
Notice that matrix B is the same for all regression, the identical terms for each regression are just a and v, which are very easy to calculate. So theoretically, we can reduce the cost of computation significantly.
Now test two simulation examples by using the functions in this package.
## use fRLR package
set.seed(123)
= matrix(rnorm(50), 10, 5)
X = rnorm(10)
Y = matrix(rnorm(40), 10, 4)
COV frlr1(X, Y, COV)
#> r r.p.value
#> 1 0 0.4380128
#> 2 1 0.7791076
#> 3 2 0.2212869
#> 4 3 0.9495018
#> 5 4 0.6729983
## use simple loop
= matrix(nrow = 0, ncol = 2)
res for (i in 1:ncol(X))
{= cbind(X[,i], COV)
mat = as.data.frame(mat)
df = lm(Y~., data = df)
model = c(i, summary(model)$coefficients[2, 4])
tmp = rbind(res, tmp)
res
}
res#> [,1] [,2]
#> tmp 1 0.4380128
#> tmp 2 0.7791076
#> tmp 3 0.2212869
#> tmp 4 0.9495018
#> tmp 5 0.6729983
As we can see in the above output, these p-values for the identical variable in each regression are equal between two methods.
Similarly, we can test another example
set.seed(123)
= matrix(rnorm(50), 10, 5)
X = rnorm(10)
Y = matrix(rnorm(40), 10, 4)
COV
= c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
idx1 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
idx2
frlr2(X, idx1, idx2, Y, COV)
#> r1 r2 r1.p.value r2.p.value
#> 1 1 2 0.53021406 0.895719578
#> 2 2 3 0.01812006 0.009833047
#> 3 3 4 0.29895922 0.963995969
#> 4 4 5 0.91749181 0.712075464
#> 5 1 3 0.33761507 0.210331456
#> 6 1 4 0.51074586 0.966484642
#> 7 1 5 0.12479380 0.152802911
#> 8 2 4 0.79302893 0.902402294
#> 9 2 5 0.73153760 0.663392258
#> 10 3 5 0.32367303 0.877154122
#> 11 2 3 0.01812006 0.009833047
#> 12 1 2 0.53021406 0.895719578
#> 13 3 4 0.29895922 0.963995969
#> 14 4 5 0.91749181 0.712075464
#> 15 1 3 0.33761507 0.210331456
#> 16 1 4 0.51074586 0.966484642
#> 17 1 5 0.12479380 0.152802911
#> 18 2 4 0.79302893 0.902402294
#> 19 2 5 0.73153760 0.663392258
#> 20 3 5 0.32367303 0.877154122
= matrix(nrow=0, ncol=4)
res for (i in 1:length(idx1))
{= cbind(X[, idx1[i]], X[,idx2[i]], COV)
mat = as.data.frame(mat)
df = lm(Y~., data = df)
model = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
tmp = rbind(res, tmp)
res }
Again, we obtain the same results by different methods.
The main aim of this new method is to reduce the computation cost. Now let’s compare its speed with the simple-loop method.
We can obtain the following time cost for 99×100/2=4950 linear regressions.
set.seed(123)
= 100
n = matrix(rnorm(10*n), 10, n)
X = rnorm(10)
Y = matrix(rnorm(40), 10, 4)
COV
#idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
#idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
= combn(n, 2)
id = id[1, ]
idx1 = id[2, ]
idx2
system.time(frlr2(X, idx1, idx2, Y, COV))
#> user system elapsed
#> 2.048 0.109 0.965
<- function()
simpleLoop
{= matrix(nrow=0, ncol=4)
res for (i in 1:length(idx1))
{= cbind(X[, idx1[i]], X[,idx2[i]], COV)
mat = as.data.frame(mat)
df = lm(Y~., data = df)
model = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
tmp = rbind(res, tmp)
res
}
}
system.time(simpleLoop())
#> user system elapsed
#> 9.275 0.011 9.413