问题
I have a data with one time column and 2 variables.(example below)
df <- structure(list(time = c(15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26), var1 = c(20.4, 31.5, NA, 53.7, 64.8, NA, NA, NA, NA,
120.3, NA, 142.5), var2 = c(30.6, 47.25, 63.9, 80.55, 97.2, 113.85,
130.5, 147.15, 163.8, 180.45, 197.1, 213.75)), .Names = c("time",
"var1", "var2"), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
The var1 has few NA and I want to fill the NA with linear regression between remaining values in var1 and var2.
Please Help!! And let me know if you need more information
回答1:
Here is an example using lm
to predict values in R.
library(dplyr)
# Construct linear model based on non-NA pairs
df2 <- df %>% filter(!is.na(var1))
fit <- lm(var1 ~ var2, data = df2)
# See the result
summary(fit)
# Call:
# lm(formula = var1 ~ var2, data = df2)
#
# Residuals:
# 1 2 3 4 5 6
# 8.627e-15 -2.388e-15 1.546e-16 -9.658e-15 -2.322e-15 5.587e-15
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.321e-14 5.619e-15 4.130e+00 0.0145 *
# var2 6.667e-01 4.411e-17 1.511e+16 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 7.246e-15 on 4 degrees of freedom
# Multiple R-squared: 1, Adjusted R-squared: 1
# F-statistic: 2.284e+32 on 1 and 4 DF, p-value: < 2.2e-16
#
# Warning message:
# In summary.lm(fit) : essentially perfect fit: summary may be unreliable
# Use fit to predict the value
df3 <- df %>%
mutate(pred = predict(fit, .)) %>%
# Replace NA with pred in var1
mutate(var1 = ifelse(is.na(var1), pred, var1))
# See the result
df3 %>% as.data.frame()
# time var1 var2 pred
# 1 15 20.4 30.60 20.4
# 2 16 31.5 47.25 31.5
# 3 17 42.6 63.90 42.6
# 4 18 53.7 80.55 53.7
# 5 19 64.8 97.20 64.8
# 6 20 75.9 113.85 75.9
# 7 21 87.0 130.50 87.0
# 8 22 98.1 147.15 98.1
# 9 23 109.2 163.80 109.2
# 10 24 120.3 180.45 120.3
# 11 25 131.4 197.10 131.4
# 12 26 142.5 213.75 142.5
回答2:
Here is a one liner using the approx
function from base R:
newvar1<-approx(df$time, df$var1, xout=df$time)
This function will apply a linear approximation between neighboring points was opposed to "www" answer which applies the linear approximation across all of the points. With this data, both solutions provide the same results since time and var1 has a perfect linear relationship, may not always be the case.
The xout option specifies the location where to estimate the new values, in this case I am passing the original time vector.
Related: See the spline
function for a cubic approximation.
回答3:
I realize this is an old question but this might be a useful brute-force technique
generate your linear model
fit <- lm(var1 ~ var2, data = df)
Save the coefficients into an object using coef()
fit.c <- coef(fit)
fit.c
Use those coefficient to generate a predicted value as a new variable. The bracketed numbers indicate the position of the coefficient in the vector fit.c. fit.c[1] is the intercept.
df$pred <- fit.c[1] + fit.c[2]*df$var2
You may at this time replace NA values in the original variable
df$var1[is.na(df$var1)] <- df$pred
However my instincts say to not overwrite values in your original variable and instead use pred for whatever purpose you planned for var1.
来源:https://stackoverflow.com/questions/49634504/filling-na-using-linear-regression-in-r