Filling NA using linear regression in R

会有一股神秘感。 提交于 2020-12-30 04:12:58

问题


I have a data with one time column and 2 variables.(example below)

df <- structure(list(time = c(15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 
                              25, 26), var1 = c(20.4, 31.5, NA, 53.7, 64.8, NA, NA, NA, NA, 
                              120.3, NA, 142.5), var2 = c(30.6, 47.25, 63.9, 80.55, 97.2, 113.85, 
                              130.5, 147.15, 163.8, 180.45, 197.1, 213.75)), .Names = c("time", 
                              "var1", "var2"), row.names = c(NA, -12L), class = c("tbl_df", 
                              "tbl", "data.frame"))

The var1 has few NA and I want to fill the NA with linear regression between remaining values in var1 and var2.

Please Help!! And let me know if you need more information


回答1:


Here is an example using lm to predict values in R.

library(dplyr)

# Construct linear model based on non-NA pairs
df2 <- df %>% filter(!is.na(var1))

fit <- lm(var1 ~ var2, data = df2)

# See the result
summary(fit)

# Call:
#   lm(formula = var1 ~ var2, data = df2)
# 
# Residuals:
#   1          2          3          4          5          6 
# 8.627e-15 -2.388e-15  1.546e-16 -9.658e-15 -2.322e-15  5.587e-15 
# 
# Coefficients:
#   Estimate Std. Error   t value Pr(>|t|)    
# (Intercept) 2.321e-14  5.619e-15 4.130e+00   0.0145 *  
#   var2        6.667e-01  4.411e-17 1.511e+16   <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 7.246e-15 on 4 degrees of freedom
# Multiple R-squared:      1,   Adjusted R-squared:      1 
# F-statistic: 2.284e+32 on 1 and 4 DF,  p-value: < 2.2e-16
# 
# Warning message:
#   In summary.lm(fit) : essentially perfect fit: summary may be unreliable

# Use fit to predict the value
df3 <- df %>% 
  mutate(pred = predict(fit, .)) %>%
  # Replace NA with pred in var1
  mutate(var1 = ifelse(is.na(var1), pred, var1))

# See the result
df3 %>% as.data.frame()

#    time  var1   var2  pred
# 1    15  20.4  30.60  20.4
# 2    16  31.5  47.25  31.5
# 3    17  42.6  63.90  42.6
# 4    18  53.7  80.55  53.7
# 5    19  64.8  97.20  64.8
# 6    20  75.9 113.85  75.9
# 7    21  87.0 130.50  87.0
# 8    22  98.1 147.15  98.1
# 9    23 109.2 163.80 109.2
# 10   24 120.3 180.45 120.3
# 11   25 131.4 197.10 131.4
# 12   26 142.5 213.75 142.5



回答2:


Here is a one liner using the approx function from base R:

newvar1<-approx(df$time, df$var1, xout=df$time)

This function will apply a linear approximation between neighboring points was opposed to "www" answer which applies the linear approximation across all of the points. With this data, both solutions provide the same results since time and var1 has a perfect linear relationship, may not always be the case.
The xout option specifies the location where to estimate the new values, in this case I am passing the original time vector.

Related: See the spline function for a cubic approximation.




回答3:


I realize this is an old question but this might be a useful brute-force technique

generate your linear model

fit <- lm(var1 ~ var2, data = df)

Save the coefficients into an object using coef()

fit.c <- coef(fit)
fit.c

Use those coefficient to generate a predicted value as a new variable. The bracketed numbers indicate the position of the coefficient in the vector fit.c. fit.c[1] is the intercept.

df$pred <- fit.c[1] + fit.c[2]*df$var2

You may at this time replace NA values in the original variable

df$var1[is.na(df$var1)] <- df$pred 

However my instincts say to not overwrite values in your original variable and instead use pred for whatever purpose you planned for var1.



来源:https://stackoverflow.com/questions/49634504/filling-na-using-linear-regression-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!