Faster alternative to R car::Anova for sum of square crossproduct matrix calculation for subsets of predictors

问题

I need to compute the sum of squares crossproduct matrix (indeed the trace of this matrix) in a multivariate linear model, with Y (n x q) and X (n x p). Standard R code for doing that is:

require(MASS)
require(car)

# Example data 
q <- 10
n  <- 1000
p <- 10
Y <- mvrnorm(n, mu = rep(0, q), Sigma = diag(q))
X <- as.data.frame(mvrnorm(n, mu = rnorm(p), Sigma = diag(p)))

# Fit lm
fit <- lm( Y ~ ., data = X )

# Type I sums of squares
summary(manova(fit))$SS    

# Type III sums of squares
type = 3 # could be also 2 (II)
car::Anova(fit, type = type)$SSP

This has to be done thousands of times, unfortunately, it gets slow when the number of predictors is relatively large. As often I am interested only in a subset of s predictors, I tried to re-implement this calculation. Although my implementation directly translating linear algebra for s = 1 (below) is faster for small sample sizes (n),

# Hat matrix (X here stands for the actual design matrix)
H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)

# Remove predictor of interest (e.g. 2)
X.r <- X[, -2]  
H1 <- tcrossprod(tcrossprod(X.r, solve(crossprod(X.r))), X.r) 

# Compute e.g. type III sum of squares
SS <- crossprod(Y, H - H1) %*% Y

car still goes faster for large n:

I already tried Rcpp implementation which much success, as these matrix products in R already use a very efficient code.

Any hint on how to do this faster?

UPDATE

After reading the answers, I tried the solution proposed in this post which relies on QR/SVD/Cholesky factorization for hat matrix calculation. However it seems that car::Anova is still faster to compute all p = 30 matrices than me computing just one (s = 1)!! for e.g. n = 5000, q = 10:

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
   ME 1137.5692 1202.9888 1257.8979 1251.6834 1318.9282 1398.9343    10
   QR 1005.9082 1031.9911 1084.5594 1037.5659 1095.7449 1364.9508    10
  SVD 1026.8815 1065.4629 1152.6631 1087.9585 1241.4977 1446.8318    10
 Chol  969.9089 1056.3093 1115.9608 1102.1169 1210.7782 1267.1274    10
  CAR  205.1665  211.8523  218.6195  214.6761  222.0973  242.4617    10

UPDATE 2

The best solution for now was to go over the car::Anova code (i.e. functions car:::Anova.III.mlm and subsequently car:::linearHypothesis.mlm) and re-implement them to account for a subset of predictors, instead of all of them.

The relevant code by car is as follows (I skipped checks, and simplified a bit):

B <- coef(fit)                    # Model coefficients
M <- model.matrix(fit)            # Model matrix M
V <- solve(crossprod(M))          # M'M
p <- ncol(M)                      # Number of predictors in M
I.p <- diag(p)                    # Identity (p x p)
terms <- labels(terms(fit))       # terms (add intercept)       
terms <- c("(Intercept)", terms)   
n.terms <- length(terms)
assign <- fit$assign              # assignation terms <-> p variables
  
SSP <- as.list(rep(0, n.terms))   # Initialize empty list for sums of squares cross-product matrices
names(SSP) <- terms
  
for (term in 1:n.terms){
    subs <- which(assign == term - 1)
    L <- I.p[subs, , drop = FALSE]
    SSP[[term]] <- t(L %*% B) %*% solve(L %*% V %*% t(L)) %*% (L %*% B)
}

Then it is just a matter of selecting the subset of terms.

回答1:

This line and the similar one below it for H1 could probably be improved:

H <- tcrossprod(tcrossprod(X, solve(crossprod(X))), X)

The general idea is that you should rarely use solve(Y) %*% Z, because it is the same as solve(Y, Z) but slower. I haven't fully expanded your tcrossprod calls to see what the best equivalent formulation of the expressions for H and H1 would be.

You could also look at this question https://stats.stackexchange.com/questions/139969/speeding-up-hat-matrices-like-xxx-1x-projection-matrices-and-other-as for a description of doing it via QR decomposition.

来源：https://stackoverflow.com/questions/64534242/faster-alternative-to-r-caranova-for-sum-of-square-crossproduct-matrix-calcula

标签

performance

matrix

manova