Filling NA using linear regression in R

问题

I have a data with one time column and 2 variables.(example below)

df <- structure(list(time = c(15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 
                              25, 26), var1 = c(20.4, 31.5, NA, 53.7, 64.8, NA, NA, NA, NA, 
                              120.3, NA, 142.5), var2 = c(30.6, 47.25, 63.9, 80.55, 97.2, 113.85, 
                              130.5, 147.15, 163.8, 180.45, 197.1, 213.75)), .Names = c("time", 
                              "var1", "var2"), row.names = c(NA, -12L), class = c("tbl_df", 
                              "tbl", "data.frame"))

The var1 has few NA and I want to fill the NA with linear regression between remaining values in var1 and var2.

Please Help!! And let me know if you need more information

回答1:

Here is an example using lm to predict values in R.

library(dplyr)

# Construct linear model based on non-NA pairs
df2 <- df %>% filter(!is.na(var1))

fit <- lm(var1 ~ var2, data = df2)

# See the result
summary(fit)

# Call:
#   lm(formula = var1 ~ var2, data = df2)
# 
# Residuals:
#   1          2          3          4          5          6 
# 8.627e-15 -2.388e-15  1.546e-16 -9.658e-15 -2.322e-15  5.587e-15 
# 
# Coefficients:
#   Estimate Std. Error   t value Pr(>|t|)    
# (Intercept) 2.321e-14  5.619e-15 4.130e+00   0.0145 *  
#   var2        6.667e-01  4.411e-17 1.511e+16   <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 7.246e-15 on 4 degrees of freedom
# Multiple R-squared:      1,   Adjusted R-squared:      1 
# F-statistic: 2.284e+32 on 1 and 4 DF,  p-value: < 2.2e-16
# 
# Warning message:
#   In summary.lm(fit) : essentially perfect fit: summary may be unreliable

# Use fit to predict the value
df3 <- df %>% 
  mutate(pred = predict(fit, .)) %>%
  # Replace NA with pred in var1
  mutate(var1 = ifelse(is.na(var1), pred, var1))

# See the result
df3 %>% as.data.frame()

#    time  var1   var2  pred
# 1    15  20.4  30.60  20.4
# 2    16  31.5  47.25  31.5
# 3    17  42.6  63.90  42.6
# 4    18  53.7  80.55  53.7
# 5    19  64.8  97.20  64.8
# 6    20  75.9 113.85  75.9
# 7    21  87.0 130.50  87.0
# 8    22  98.1 147.15  98.1
# 9    23 109.2 163.80 109.2
# 10   24 120.3 180.45 120.3
# 11   25 131.4 197.10 131.4
# 12   26 142.5 213.75 142.5

回答2:

Here is a one liner using the approx function from base R:

newvar1<-approx(df$time, df$var1, xout=df$time)

This function will apply a linear approximation between neighboring points was opposed to "www" answer which applies the linear approximation across all of the points. With this data, both solutions provide the same results since time and var1 has a perfect linear relationship, may not always be the case.
The xout option specifies the location where to estimate the new values, in this case I am passing the original time vector.

Related: See the spline function for a cubic approximation.

回答3:

I realize this is an old question but this might be a useful brute-force technique

generate your linear model

fit <- lm(var1 ~ var2, data = df)

Save the coefficients into an object using coef()

fit.c <- coef(fit)
fit.c

Use those coefficient to generate a predicted value as a new variable. The bracketed numbers indicate the position of the coefficient in the vector fit.c. fit.c[1] is the intercept.

df$pred <- fit.c[1] + fit.c[2]*df$var2

You may at this time replace NA values in the original variable

df$var1[is.na(df$var1)] <- df$pred

However my instincts say to not overwrite values in your original variable and instead use pred for whatever purpose you planned for var1.

来源：https://stackoverflow.com/questions/49634504/filling-na-using-linear-regression-in-r

标签

linear-regression