问题
Original question
I want to smooth my explanatory variable, something like Speed data of a vehicle, and then use this smoothed values. I searched a lot, and find nothing that directly is my answer.
I know how to calculate the kernel density estimation (density()
or KernSmooth::bkde()
) but I don't know then how to calculate the smoothed values of speed.
Re-edited question
Thanks to @ZheyuanLi, I am able to better explain what I have and what I want to do. So I have re-edited my question as below.
I have some speed measurement of a vehicle during a time, stored as a data frame vehicle
:
t speed
1 0 0.0000000
2 1 0.0000000
3 2 0.0000000
4 3 0.0000000
5 4 0.0000000
. . .
. . .
1031 1030 4.8772222
1032 1031 4.4525000
1033 1032 3.2261111
1034 1033 1.8011111
1035 1034 0.2997222
1036 1035 0.2997222
Here is a scatter plot:
I want to smooth speed
against t
, and I want to use kernel smoothing for this purpose. According to @Zheyuan's advice, I should use ksmooth()
:
fit <- ksmooth(vehicle$t, vehicle$speed)
However, I found that the smoothed values are exactly the same as my original data:
sum(abs(fit$y - vehicle$speed)) # 0
Why is this happening? Thanks!
回答1:
Answer to old question
You need to distinguish "kernel density estimation" and "kernel smoothing".
Density estimation, only works with a single variable. It aims to estimate how spread out this variable is on its physical domain. For example, if we have 1000 normal samples:
x <- rnorm(1000, 0, 1)
We can assess its distribution by kernel density estimator:
k <- density(x)
plot(k); rug(x)
The rugs on the x-axis shows the locations of your x
values, while the curve measures the density of those rugs.
Kernel smoother, is actually a regression problem, or scatter plot smoothing problem. You need two variables: one response variable y
, and an explanatory variable x
. Let's just use the x
we have above for the explanatory variable. For response variable y
, we generate some toy values from
y <- sin(x) + rnorm(1000, 0, 0.2)
Given the scatter plot between y
and x
:
we want to find a smooth function to approximate those scattered dots.
The Nadaraya-Watson kernel regression estimate, with R function ksmooth()
will help you:
s <- ksmooth(x, y, kernel = "normal")
plot(x,y, main = "kernel smoother")
lines(s, lwd = 2, col = 2)
If you want to interpret everything in terms of prediction:
- kernel density estimation: given
x
, predict density ofx
; that is, we have an estimate of the probabilityP(grid[n] < x < grid[n+1])
, wheregrid
is some gird points; - kernel smoothing: given
x
, predicty
; that is, we have an estimate of the functionf(x)
, which approximatesy
.
In both cases, you have no smoothed value of explanatory variable x
. So your question: "I want to smooth my explanatory variable" makes no sense.
Do you actually have a time series?
"Speed of a vehicle" sounds like you are monitoring the speed
along time t
. If so, get a scatter plot between speed
and t
, and use ksmooth()
.
Other smoothing approach like loess()
and smooth.spline()
are not of kernel smoothing class, but you can compare.
回答2:
Answer on re-edited question
The default bandwidth for ksmooth()
is 0.5:
ksmooth(x, y, kernel = c("box", "normal"), bandwidth = 0.5,
range.x = range(x),
n.points = max(100L, length(x)), x.points)
For you time series data with lag 1, this means there will be no other speed
data in the neighbourhood (i-0.5, i+0.5)
, for time t = i
, except speed[i]
. As a result, no local weighted average is done!
You need to choose a larger bandwidth. For example, if we hope to average over 20 values, we should set bandwidth = 10
(not 20 as it is two-sided). This is what we get:
fit <- ksmooth(vehicle$t, vehicle$speed, bandwidth = 10)
plot(vehicle, cex = 0.5)
lines(fit,col=2,lwd = 2)
Smoothness selection
One problem with ksmooth()
, is that you must set bandwidth
yourself. You can see that this parameter shapes the fitted curve greatly. Large bandwidth
makes the curve smooth, but far away from data; while small bandwidth does the reverse.
Is there an optimal bandwidth
? Is there a way to select the best one?
Yes, use sm.regression()
from sm
package, with cross-validation method for selecting bandwidth.
fit <- sm.regression(vehicle$t, vehicle$speed, method = "cv", eval.points = 0:1035)
## plot will be automatically generated!
You can check that fit$h
is 18.7.
Other approach
Perhaps you think sm.regression()
oversmooths your data? Well, use loess()
, or my favourite: smooth.spline()
.
I had an answer:
- regarding
smooth.spline()
at smooth.spline(): fitted model does not match user-specified degree of freedom; this one is very technical! - regarding
smooth.spline()
at R smooth.spline(): smoothing spline is not smooth but overfitting my data; this one is practical modelling. - regarding
loess()
at Problems displaying LOESS regression line and confidence interval; this one is about general use ofloess()
.
Here, I would demonstrate the use of smooth.spline()
:
fit <- smooth.spline(vehicle$t, vehicle$speed, all.knots = TRUE, control.spar = list(low = -2, hight = 2))
# Call:
# smooth.spline(x = vehicle$t, y = vehicle$speed, all.knots = TRUE,
# control.spar = list(low = -2, hight = 2))
# Smoothing Parameter spar= 0.2519922 lambda= 4.379673e-11 (14 iterations)
# Equivalent Degrees of Freedom (Df): 736.0882
# Penalized Criterion: 3.356859
# GCV: 0.03866391
plot(vehicle, cex = 0.5)
lines(fit$x, fit$y, col = 2, lwd = 2)
Or using its regression spline version:
fit <- smooth.spline(vehicle$t, vehicle$speed, nknots = 200)
plot(vehicle, cex = 0.5)
lines(fit$x, fit$y, col = 2, lwd = 2)
You really need to read my first link above, to understand why I use control.spar
in the first case, while without it in the second case.
More powerful package
I would definitely recommend mgcv
. I have several answers regarding mgcv
, but I don't want to overwhelm you. So, I will not make extension here. Learn to use ksmooth()
, smooth.spline()
and loess()
well. In future, when you meet more complicated problem, come back to stack overflow and ask for help!
来源:https://stackoverflow.com/questions/37952793/scatter-plot-kernel-smoothing-ksmooth-does-not-smooth-my-data-at-all