Linear model singular because of large integer datetime in R?

前提是你 提交于 2020-01-11 12:09:11

问题


Simple regression of random normal on date fails, but identical data with small integers instead of dates works as expected.

# Example dataset with 100 observations at 2 second intervals.
set.seed(1)
df <- data.frame(x=as.POSIXct("2017-03-14 09:00:00") + seq(0, 199, 2),
                 y=rnorm(100))

#> head(df)
#                     x          y
# 1 2017-03-14 09:00:00 -0.6264538
# 2 2017-03-14 09:00:02  0.1836433
# 3 2017-03-14 09:00:04 -0.8356286

# Simple regression model.
m <- lm(y ~ x, data=df)

The slope is missing due to singularities in the data. Calling the summary demonstrates this:

summary(m)

# Coefficients: (1 not defined because of singularities)
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)  0.10889    0.08982   1.212    0.228
# x                 NA         NA      NA       NA

Could this be because of the POSIXct class?

# Convert date variable to integer.
df$x2 <- as.integer(df$x)
lm(y ~ x2, data=df)

# Coefficients:
# (Intercept)           x2  
#      0.1089           NA

Nope, coefficient for x2 still missing.

What if we make the baseline of x2 zero?

# Subtract minimum of x.
df$x3 <- df$x2 - min(df$x2)
lm(y ~ x3, data=df)

# Coefficients:
# (Intercept)           x3  
#   0.1312147   -0.0002255

This works!

One more example to rule out that this is due to datetime variable.

# Subtract large constant from date (data is now from 1985).
df$x4 <- df$x - 1000000000
lm(y ~ x4, data=df)

# Coefficients:
# (Intercept)           x4  
#   1.104e+05   -2.255e-04

Not expected (why would an identical dataset with 30 years difference cause different behaviour?), but this works too.

Could be that .Machine$integer.max (2147483647 on my PC) has something to do with it, but I can't figure it out. It would be greatly appreciated if someone could explain what's going on here.


回答1:


Yes, it could. QR factorization is stable, but is not almighty God.

X <- cbind(1, 1e+11 + 1:10000)
qr(X)$rank
# 1

Here the X is like the model matrix for your linear regression model, where there is a all-1 column for intercept, and there is a sequence for datetime (note the large offset).

If you center the datetime column, these two columns will be orthogonal hence very stable (even when solving normal equation directly!).



来源:https://stackoverflow.com/questions/42781741/linear-model-singular-because-of-large-integer-datetime-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!