In order to clusterize a set of time series I\'m looking for a smart distance metric. I\'ve tried some well known metric but no one fits to my case.
ex: Let\'s assume t
Pietro P's answer is just a special case of applying a convolution to your time series.
If I gave the kernel:
[1,1,...,1,1,1,0,0,0,0,...0,0]
I would get a cumulative series .
Adding a convolution works because you're giving each data point information about it's neighbours - it's now order dependent.
It might be interesting to try with a guassian convolution or other kernels.
Another approach would be by utilizing DTW which is an algorithm to compute the similarity between two temporal sequences. Full disclosure; I coded a Python package for this purpose called trendypy
, you can download via pip (pip install trendypy
). Here is a demo on how to utilize the package. You're just just basically computing the total min distance for different combinations to set the cluster centers.
nice question! using any standard distance of R^n (euclidean, manhattan or generically minkowski) over those time series cannot achieve the result you want, since those metrics are independent of the permutations of the coordinate of R^n (while time is strictly ordered and it is the phenomenon you want to capture).
A simple trick, that can do what you ask is using the cumulated version of the time series (sum values over time as time increases) and then apply a standard metric. Using the Manhattan metric, you would get as a distance between two time series the area between their cumulated versions.
what about using standard Pearson correlation coefficient? then you can assign the new point to the cluster with the highest coefficient.
correlation = scipy.stats.pearsonr(<new time series>, <centroid>)