Dealing with missing values for correlations calculation

走远了吗. 提交于 2019-12-29 19:26:04

问题


I have huge matrix with a lot of missing values. I want to get the correlation between variables.

1. Is the solution

cor(na.omit(matrix))

better than below?

cor(matrix, use = "pairwise.complete.obs")

I already have selected only variables having more than 20% of missing values.

2. Which is the best method to make sense ?


回答1:


I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in the spaces. See Harrell's text "Regression Modeling Strategies" for a wealth of guidance on 'how's to do this properly.




回答2:


I think the second option makes more sense,

You might consider using the rcorr function in the Hmisc package.

It is very fast, and only includes pairwise complete observations. The returned object contains a matrix

  1. of correlation scores
  2. with the number of observation used for each correlation value
  3. of a p-value for each correlation

This means that you can ignore correlation values based on a small number of observations (whatever that threshold is for you) or based on a the p-value.

library(Hmisc)
x<-matrix(nrow=10,ncol=10,data=runif(100))
x[x>0.5]<-NA
result<-rcorr(x)
result$r[result$n<5]<-0 # ignore less than five observations
result$r



回答3:


For future readers Pairwise-complete correlation considered dangerous may be valuable, arguing that cor(matrix, use = "pairwise.complete.obs") is considered dangerous and suggesting alternatives such as use = "complete.obs").



来源:https://stackoverflow.com/questions/7445639/dealing-with-missing-values-for-correlations-calculation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!