问题
I want to be able to regression coefficients from multiple linear regression by supplying a correlation or covariance matrix instead of a data.frame. I realise you lose some information relevant to determining the intercept and so on, but it should even the correlation matrix should be sufficient for getting standardised coefficients and estimates of variance explained.
So for example, if you had the following data
# get some data
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
You could run a regression as follows:
lm(EngineSize ~ Horsepower + RPM, x)
but what if instead of having data you had the correlation matrix or the covariance matrix:
corx <- cor(x)
covx <- cov(x)
- What function in R allows you to run a regression based on the correlation or covariance matrix? Ideally it should be similar to
lm
so that you can easily obtain things like r-squared, adjusted r-squared, predicted values and so on. Presumably, for some of these things, you would need to also provide the sample size and possibly a vector of means. But that would also be fine.
I.e., something like:
lm(EngineSize ~ Horsepower + RPM, cov = covx) # obviously this doesn't work
Note that this answer on Stats.SE provides a theoretical explanation for why it's possible, and provides an example of some custom R code for calculating coefficients?
回答1:
Using lavaan you could do the following:
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
lav.input<- cov(x)
lav.mean <- colMeans(x)
library(lavaan)
m1 <- 'EngineSize ~ Horsepower+RPM'
fit <- sem(m1, sample.cov = lav.input,sample.nobs = nrow(x), meanstructure = TRUE, sample.mean = lav.mean)
summary(fit, standardize=TRUE)
Results are:
Regressions:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
EngineSize ~
Horsepower 0.015 0.001 19.889 0.000 0.015 0.753
RPM -0.001 0.000 -15.197 0.000 -0.001 -0.576
Intercepts:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
EngineSize 5.805 0.362 16.022 0.000 5.805 5.627
Variances:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
EngineSize 0.142 0.021 6.819 0.000 0.142 0.133
回答2:
Remember that:
$beta=(X'X)^-1. X'Y$
Try:
(bs<-solve(covx[-1,-1],covx[-1,1]))
Horsepower RPM
0.01491908 -0.00100051
For the Intercept you will need averages of the variables. For example:
ms=colMeans(x)
(b0=ms[1]-bs%*%ms[-1])
[,1]
[1,] 5.805301
回答3:
I think lavaan sounds like a good option, I note that @Philip pointed me in the right direction. I just mention here how to extract a few extra model features using lavaan (particularly, r-squared and adjusted r-squared) that you might want.
For the latest version see: https://gist.github.com/jeromyanglim/9f766e030966eaa1241f10bd7d6e2812 :
# get data
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
# define sample statistics
covx <- cov(x)
n <- nrow(x)
means <- sapply(x, mean) # this is optional
fit <- lavaan::sem("EngineSize ~ Horsepower + RPM", sample.cov = covx,
sample.mean = means,
sample.nobs = n)
coef(fit) # unstandardised coefficients
standardizedSolution(fit) # Standardised coefficients
inspect(fit, 'r2') # r-squared
# adjusted r-squared
adjr2 <- function(rsquared, n, p) 1 - (1-rsquared) * ((n-1)/(n-p-1))
# update p below with number of predictor variables
adjr2(inspect(fit, 'r2'), n = inspect(fit, "nobs"), p = 2)
Custom function
And here is a bit of a function that supplies the fit from lavaan along with a few features of relevance (i.e., basically packaging most of the above). It assumes a case where you don't have the means.
covlm <- function(dv, ivs, n, cov) {
# Assumes lavaan package
# library(lavaan)
# dv: charcter vector of length 1 with name of outcome variable
# ivs: character vector of names of predictors
# n: numeric vector of length 1: sample size
# cov: covariance matrix where row and column names
# correspond to dv and ivs
# Return
# list with lavaan model fit
# and various other features of the model
results <- list()
eq <- paste(dv, "~", paste(ivs, collapse = " + "))
results$fit <- lavaan::sem(eq, sample.cov = cov,
sample.nobs = n)
# coefficients
ufit <- parameterestimates(results$fit)
ufit <- ufit[ufit$op == "~", ]
results$coef <- ufit$est
names(results$coef) <- ufit$rhs
sfit <- standardizedsolution(results$fit)
sfit <- sfit[sfit$op == "~", ]
results$standardizedcoef <- sfit$est.std
names(results$standardizedcoef) <- sfit$rhs
# use unclass to not limit r2 to 3 decimals
results$r.squared <- unclass(inspect(results$fit, 'r2')) # r-squared
# adjusted r-squared
adjr2 <- function(rsquared, n, p) 1 - (1-rsquared) * ((n-1)/(n-p-1))
results$adj.r.squared <- adjr2(unclass(inspect(results$fit, 'r2')),
n = n, p = length(ivs))
results
}
For example:
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
covlm(dv = "EngineSize", ivs = c("Horsepower", "RPM"),
n = nrow(x), cov = cov(x))
This all produces:
$fit
lavaan (0.5-20) converged normally after 27 iterations
Number of observations 93
Estimator ML
Minimum Function Test Statistic 0.000
Degrees of freedom 0
Minimum Function Value 0.0000000000000
$coef
Horsepower RPM
0.01491908 -0.00100051
$standardizedcoef
Horsepower RPM
0.7532350 -0.5755326
$r.squared
EngineSize
0.867
$adj.r.squared
EngineSize
0.864
回答4:
Another kind of funky solution is to generate a data set that has the same variance-covariance matrix as the original data. You can do this with mvrnorm()
in the MASS
package. Using lm()
on this new data set will yield parameter estimates and standard errors identical to those that would have been estimated from the original data set (except for the intercept, which is inaccessible unless you have the means of each variable). Here's an example of what this would look like:
#Assuming the variance covariance matrix is called VC
n <- 100 #sample size
nvar <- ncol(VC)
fake.data <- mvrnorm(n, mu = rep(0, nvar), sigma = VC, empirical = TRUE)
lm(Y~., data = fake.data)
来源:https://stackoverflow.com/questions/38558278/how-to-get-regression-coefficients-and-model-fits-using-correlation-or-covarianc