问题
I have 54 points. They represent offer and demand for products. I would like to show there is a break point in the offer.
First, I sort the x-axis (offer) and remove the values that appears twice. I have 47 values, but I remove the first and last ones (doesn't make sense to consider them as break points). Break is of length 45:
Break<-(sort(unique(offer))[2:46])
Then, for each of these potential break points, I estimate a model and I keep in "d" the residual standard error (sixth element in model summary object).
d<-numeric(45)
for (i in 1:45) {
model<-lm(demand~(offer<Break[i])*offer + (offer>=Break[i])*offer)
d[i]<-summary(model)[[6]] }
Plotting d, I notice that my smaller residual standard error is 34, that corresponds to "Break[34]": 22.4. So I write my model with my final break point:
model<-lm(demand~(offer<22.4)*offer + (offer>=22.4)*offer)
Finally, I'm happy with my new model. It's significantly better than the simple linear one. And I want to draw it:
plot(demand~offer)
i <- order(offer)
lines(offer[i], predict(model,list(offer))[i])
But I have a warning message:
Warning message:
In predict.lm(model, list(offer)) :
prediction from a rank-deficient fit may be misleading
And more important, the lines are really strange on my plot.
Here are my data:
demand <- c(1155, 362, 357, 111, 703, 494, 410, 63, 616, 468, 973, 235,
180, 69, 305, 106, 155, 422, 44, 1008, 225, 321, 1001, 531, 143,
251, 216, 57, 146, 226, 169, 32, 75, 102, 4, 68, 102, 462, 295,
196, 50, 739, 287, 226, 706, 127, 85, 234, 153, 4, 373, 54, 81,
18)
offer <- c(39.3, 23.5, 22.4, 6.1, 35.9, 35.5, 23.2, 9.1, 27.5, 28.6, 41.3,
16.9, 18.2, 9, 28.6, 12.7, 11.8, 27.9, 21.6, 45.9, 11.4, 16.6,
40.7, 22.4, 17.4, 14.3, 14.6, 6.6, 10.6, 14.3, 3.4, 5.1, 4.1,
4.1, 1.7, 7.5, 7.8, 22.6, 8.6, 7.7, 7.8, 34.7, 15.6, 18.5, 35,
16.5, 11.3, 7.7, 14.8, 2, 12.4, 9.2, 11.8, 3.9)
回答1:
Here is an easier approach using ggplot2
.
require(ggplot2)
qplot(offer, demand, group = offer > 22.4, geom = c('point', 'smooth'),
method = 'lm', se = F, data = dat)
EDIT. I would also recommend taking a look at this package segmented
which supports automatic detection and estimation of segmented regression models.
UPDATE:
Here is an example that makes use of the R package segmented to automatically detect the breaks
library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx - 35, 0) - 1.5*pmax(xx - 70, 0) + 15*pmax(zz - .5, 0) +
rnorm(100,0,2)
dati <- data.frame(x = xx, y = yy, z = zz)
out.lm <- lm(y ~ x, data = dati)
o <- segmented(out.lm, seg.Z = ~x, psi = list(x = c(30,60)),
control = seg.control(display = FALSE)
)
dat2 = data.frame(x = xx, y = broken.line(o)$fit)
library(ggplot2)
ggplot(dati, aes(x = x, y = y)) +
geom_point() +
geom_line(data = dat2, color = 'blue')
回答2:
Vincent has you on the right track. The only thing "weird" about the lines in your resulting plot is that lines
draws a line between each successive point, which means that "jump" you see if it simply connecting the two ends of each line.
If you don't want that connector, you have to split the lines
call into two separate pieces.
Also, I feel like you can simplify your regression a bit. Here's what I did:
#After reading your data into dat
Break <- 22.4
dat$grp <- dat$offer < Break
#Note the addition of the grp variable makes this a bit easier to read
m <- lm(demand~offer*grp,data = dat)
dat$pred <- predict(m)
plot(dat$offer,dat$demand)
dat <- dat[order(dat$offer),]
with(subset(dat,offer < Break),lines(offer,pred))
with(subset(dat,offer >= Break),lines(offer,pred))
which produces this plot:
回答3:
The weird lines are simply due to the order in which the points are plotted. The following should look better:
i <- order(offer)
lines(offer[i], predict(model,list(offer))[i])
The warning comes from the fact that the *
character is interpreted by lm
.
> lm(demand~(offer<22.4)*offer + (offer>=22.4)*offer)
Call:
lm(formula = demand ~ (offer < 22.4) * offer + (offer >= 22.4) * offer)
Coefficients:
(Intercept) offer < 22.4TRUE offer
-309.46 356.08 29.86
offer >= 22.4TRUE offer < 22.4TRUE:offer offer:offer >= 22.4TRUE
NA -20.79 NA
In addition, (offer<22.4)*offer
is a discontinuous function: this is where the discontinuity comes from.
The following should be closer to what you want.
model <- lm(
demand ~ ifelse(offer<22.4,offer-22.4,0)
+ ifelse(offer>=22.4,offer-22.4,0) )
来源:https://stackoverflow.com/questions/8758646/piecewise-regression-with-r-plotting-the-segments