问题
I have somewhere between 10-20k different time-series (24 dimensional data -- a column for each hour of the day) and I'm interested in clustering time series that exhibit roughly the same patterns of activity.
I had originally started to implement Dynamic Time Warping (DTW) because:
- Not all of my time series are perfectly aligned
- Two slightly shifted time series for my purposes should be considered similar
- Two time series with the same shape but different scales should be considered similar
The only problem I had run into with DTW was that it did not appear to scale well -- fastdtw
on a 500x500 distance matrix took ~30 minutes.
What other methods exist that would help me satisfy conditions 2 & 3?
回答1:
ARIMA can do the job, if you decompose the time series into trend, seasonality and residuals. After that, use a K-Nearest Neighbor algorithm. However, computational cost may be expensive, basically due to ARIMA.
In ARIMA:
from statsmodels.tsa.arima_model import ARIMA
model0 = ARIMA(X, dates=None,order=(2,1,0))
model1 = model0.fit(disp=1)
decomposition = seasonal_decompose(np.array(X).reshape(len(X),),freq=100)
### insert your data seasonality in 'freq'
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
As a complement to @Sushant comment, you decompose the time series and can check for similarity in one or all of the 4 plots: data, seasonality, trend and residuals.
Then an example of data:
import numpy as np
import matplotlib.pyplot as plt
sin1=[np.sin(x)+x/7 for x in np.linspace(0,30*3,14*2,1)]
sin2=[np.sin(0.8*x)+x/5 for x in np.linspace(0,30*3,14*2,1)]
sin3=[np.sin(1.3*x)+x/5 for x in np.linspace(0,30*3,14*2,1)]
plt.plot(sin1,label='sin1')
plt.plot(sin2,label='sin2')
plt.plot(sin3,label='sin3')
plt.legend(loc=2)
plt.show()
X=np.array([sin1,sin2,sin3])
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
distances
You will get the similarity:
array([[ 0. , 16.39833107],
[ 0. , 5.2312092 ],
[ 0. , 5.2312092 ]])
来源:https://stackoverflow.com/questions/58358110/clustering-similar-time-series