Clustering similar time series?

问题

I have somewhere between 10-20k different time-series (24 dimensional data -- a column for each hour of the day) and I'm interested in clustering time series that exhibit roughly the same patterns of activity.

I had originally started to implement Dynamic Time Warping (DTW) because:

Not all of my time series are perfectly aligned
Two slightly shifted time series for my purposes should be considered similar
Two time series with the same shape but different scales should be considered similar

The only problem I had run into with DTW was that it did not appear to scale well -- fastdtw on a 500x500 distance matrix took ~30 minutes.

What other methods exist that would help me satisfy conditions 2 & 3?

回答1:

ARIMA can do the job, if you decompose the time series into trend, seasonality and residuals. After that, use a K-Nearest Neighbor algorithm. However, computational cost may be expensive, basically due to ARIMA.

In ARIMA:

from statsmodels.tsa.arima_model import ARIMA

model0 = ARIMA(X, dates=None,order=(2,1,0))
model1 = model0.fit(disp=1)

decomposition = seasonal_decompose(np.array(X).reshape(len(X),),freq=100)
### insert your data seasonality in 'freq'

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

As a complement to @Sushant comment, you decompose the time series and can check for similarity in one or all of the 4 plots: data, seasonality, trend and residuals.

Then an example of data:

import numpy as np
import matplotlib.pyplot as plt
sin1=[np.sin(x)+x/7 for x in np.linspace(0,30*3,14*2,1)]
sin2=[np.sin(0.8*x)+x/5 for x in np.linspace(0,30*3,14*2,1)]
sin3=[np.sin(1.3*x)+x/5 for x in np.linspace(0,30*3,14*2,1)]
plt.plot(sin1,label='sin1')
plt.plot(sin2,label='sin2')
plt.plot(sin3,label='sin3')
plt.legend(loc=2)
plt.show()

X=np.array([sin1,sin2,sin3])

from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
distances

You will get the similarity:

array([[ 0.        , 16.39833107],
       [ 0.        ,  5.2312092 ],
       [ 0.        ,  5.2312092 ]])

来源：https://stackoverflow.com/questions/58358110/clustering-similar-time-series

标签

python

machine-learning

time-series

cluster-analysis

dtw