How to get the correlation between two timeseries using Pandas

前端 未结 2 2027
别跟我提以往
别跟我提以往 2021-02-01 23:48

I have two sets of temperature date, which have readings at regular (but different) time intervals. I\'m trying to get the correlation between these two sets of data.

I\

相关标签:
2条回答
  • 2021-02-02 00:11

    By shifting your timestamps you might be losing some accuracy. You can just perform an outer join on your time series filling NaN values with 0 and then you will have the whole timestamps (either it is a shared one or belongs to only one of the datasets). Then you may want to do the correlation function for the columns of your new dataset that will give you the result you are looking for without losing accuracy. This is my code once I was working with time series:

    t12 = t1.join(t2, lsuffix='_t1', rsuffix='_t2', how ='outer').fillna(0)
    
    t12.corr()
    

    This way you will have all timestamps.

    0 讨论(0)
  • 2021-02-02 00:24

    You have a number of options using pandas, but you have to make a decision about how it makes sense to align the data given that they don't occur at the same instants.

    Use the values "as of" the times in one of the time series, here's an example:

        In [15]: ts
        Out[15]: 
        2000-01-03 00:00:00    -0.722808451504
        2000-01-04 00:00:00    0.0125041039477
        2000-01-05 00:00:00    0.777515530539
        2000-01-06 00:00:00    -0.35714026263
        2000-01-07 00:00:00    -1.55213541118
        2000-01-10 00:00:00    -0.508166334892
        2000-01-11 00:00:00    0.58016097981
        2000-01-12 00:00:00    1.50766289013
        2000-01-13 00:00:00    -1.11114968643
        2000-01-14 00:00:00    0.259320239297
    
    
    
        In [16]: ts2
        Out[16]: 
        2000-01-03 00:00:30    1.05595278907
        2000-01-04 00:00:30    -0.568961755792
        2000-01-05 00:00:30    0.660511172645
        2000-01-06 00:00:30    -0.0327384421979
        2000-01-07 00:00:30    0.158094407533
        2000-01-10 00:00:30    -0.321679671377
        2000-01-11 00:00:30    0.977286027619
        2000-01-12 00:00:30    -0.603541295894
        2000-01-13 00:00:30    1.15993249209
        2000-01-14 00:00:30    -0.229379534767
    

    you can see these are off by 30 seconds. The reindex function enables you to align data while filling forward values (getting the "as of" value):

        In [17]: ts.reindex(ts2.index, method='pad')
        Out[17]: 
        2000-01-03 00:00:30    -0.722808451504
        2000-01-04 00:00:30    0.0125041039477
        2000-01-05 00:00:30    0.777515530539
        2000-01-06 00:00:30    -0.35714026263
        2000-01-07 00:00:30    -1.55213541118
        2000-01-10 00:00:30    -0.508166334892
        2000-01-11 00:00:30    0.58016097981
        2000-01-12 00:00:30    1.50766289013
        2000-01-13 00:00:30    -1.11114968643
        2000-01-14 00:00:30    0.259320239297
    
        In [18]: ts2.corr(ts.reindex(ts2.index, method='pad'))
        Out[18]: -0.31004148593302283
    

    note that 'pad' is also aliased by 'ffill' (but only in the very latest version of pandas on GitHub as of this time!).

    Strip seconds out of all your datetimes. The best way to do this is to use rename

        In [25]: ts2.rename(lambda date: date.replace(second=0))
        Out[25]: 
        2000-01-03 00:00:00    1.05595278907
        2000-01-04 00:00:00    -0.568961755792
        2000-01-05 00:00:00    0.660511172645
        2000-01-06 00:00:00    -0.0327384421979
        2000-01-07 00:00:00    0.158094407533
        2000-01-10 00:00:00    -0.321679671377
        2000-01-11 00:00:00    0.977286027619
        2000-01-12 00:00:00    -0.603541295894
        2000-01-13 00:00:00    1.15993249209
        2000-01-14 00:00:00    -0.229379534767
    

    Note that if rename causes there to be duplicate dates an Exception will be thrown.

    For something a little more advanced, suppose you wanted to correlate the mean value for each minute (where you have multiple observations per second):

        In [31]: ts_mean = ts.groupby(lambda date: date.replace(second=0)).mean()
    
        In [32]: ts2_mean = ts2.groupby(lambda date: date.replace(second=0)).mean()
    
        In [33]: ts_mean.corr(ts2_mean)
        Out[33]: -0.31004148593302283
    

    These last code snippets may not work if you don't have the latest code from https://github.com/wesm/pandas. If .mean() doesn't work on a GroupBy object per above try .agg(np.mean)

    Hope this helps!

    0 讨论(0)
提交回复
热议问题