I have huge matrix with a lot of missing values. I want to get the correlation between variables.
1. Is the solution
cor(na.omit(matr
I think the second option makes more sense,
You might consider using the rcorr function in the Hmisc package.
It is very fast, and only includes pairwise complete observations. The returned object contains a matrix
This means that you can ignore correlation values based on a small number of observations (whatever that threshold is for you) or based on a the p-value.
library(Hmisc)
x<-matrix(nrow=10,ncol=10,data=runif(100))
x[x>0.5]<-NA
result<-rcorr(x)
result$r[result$n<5]<-0 # ignore less than five observations
result$r
I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in the spaces. See Harrell's text "Regression Modeling Strategies" for a wealth of guidance on 'how's to do this properly.
For future readers Pairwise-complete correlation considered dangerous may be valuable, arguing that cor(matrix, use = "pairwise.complete.obs")
is considered dangerous and suggesting alternatives such as use = "complete.obs")
.