问题
Following the advice of this post on Analyzing trends in data with pandas, I have used numpy's polyfit
on several data I have. However it does not permit me to see when there is a trend and when there isn't. I wonder what am I understanding wrong.
First the code is the following
import pandas
import matplotlib.pyplot as plt
import numpy as np
file="data.csv"
df= pandas.read_csv(file,delimiter=',',header=0)
selected=df.loc[(df.index>25)&(df.index<613)]
xx=np.arange(25,612)
y= selected[selected.columns[1]].values
df.plot()
plt.plot(xx,y)
plt.xlabel("seconds")
coefficients, residuals, _, _, _ = np.polyfit(range(25,25+len(y)),y,1,full=True)
plt.plot(xx,[coefficients[0]*x + coefficients[1] for x in range(25,25+len(y))])
mse = residuals[0]/(len(y))
nrmse = np.sqrt(mse)/(y.max() - y.min())
print('Slope ' + str(coefficients[0]))
print('Degree '+str(np.degrees(np.arctan(coefficients[0]))))
print('NRMSE: ' + str(nrmse))
print('Max-Min '+str((y.max()-y.min())))
I trimmed the first and last 25 points of data. As a result I got the following:
I can clearly see that there is a trend to increase in the data. For the results I got
Slope 397.78399534197837
Degree 89.85596288567513
NRMSE: 0.010041127178789659
Max-Min 257824
and with this data
I got
Slope 349.74410929666203
Degree 89.83617844631047
NRMSE: 0.1482879344688465
Max-Min 430752
However with this data
I got
Slope 29.414468649823373
Degree 88.05287249703134
NRMSE: 0.3752760050624873
Max-Min 673124
As you can see, in this there is not so much of a tendency to increase so the slope is less.
However here
again has a big slope
Slope 228.34551214653814
Degree 89.74908456620851
NRMSE: 0.3094116937517223
Max-Min 581600
I can't understand why slope is not indicating clearly the tendencies (and much less the degres)
A second thing that disconcerts me is that the slope depends on how much the data varies in the Y axis. For example with data that varies few the slope is on the range of 0
Slope 0.00017744046645062043
Degree 0.010166589735754468
NRMSE: 0.07312155589459704
Max-Min 11.349999999999998
What is a good way to detect a trend in data, independent of its magnitude?
回答1:
The idea is that you compare whether the linear fit shows a significant increase compared to the fluctuation of the data around the fit:
In the bottom panel, you see that the trend (the fit minus the constant part) exceeds residuals (defined as the difference between data and fit). What a good criterion for 'significant increase' is, depends on the type of data and also on how many values along the x axis you have. I suggest that you take the root mean square (RMS) of the residuals. If the trend in the fit exceeds some threshold (relative to the residuals), you call it a significant trend. A suitable value of the threshold needs to be established by trial and error.
Here is the code generating the plots above:
import numpy as np
import matplotlib.pyplot as plt
# example data
x = np.arange(25, 600)
y = 1.76e7 + 3e5/600*x + 1e5*np.sin(x*0.2)
y += np.random.normal(scale=3e4, size=x.shape)
# process
a1, a0 = np.polyfit(x, y, 1)
resid = y - (a1*x + a0) # array
rms = np.sqrt((resid**2).mean())
plt.close('all')
fig, ax = plt.subplots(2, 1)
ax[0].plot(x, y, label='data')
ax[0].plot(x, a1*x+a0, label='fit')
ax[0].legend()
ax[1].plot(x, resid, label='residual')
ax[1].plot(x, a1*(x-x[0]), label='trend')
ax[1].legend()
dy_trend = a1*(x[-1] - x[0])
threshold = 0.3
print(f'dy_trend={dy_trend:.3g}; rms={rms:.3g }')
if dy_trend > threshold*rms:
print('Significant trend')
Output:
dy_trend=2.87e+05; rms=7.76e+04
Significant trend
来源:https://stackoverflow.com/questions/62533013/why-is-slope-not-a-good-measure-of-trends-for-data