Speed-improvement on large pandas read_csv with datetime index

前端 未结 3 484
臣服心动
臣服心动 2021-01-31 23:47

I have enormous files that look like this:

05/31/2012,15:30:00.029,1306.25,1,E,0,,1306.25

05/31/2012,15:30:00.029,1306.25,8,E,0,,1306.25

I can easily rea

3条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2021-02-01 00:02

    The cardinality of datetime strings is not huge. For example, number of time strings in the format %H-%M-%S is 24 * 60 * 60 = 86400. If the number of rows of your dataset is much larger than this or your data contains lots of duplicate timestamps, adding a cache in the parsing process could substantially speed things up.

    For those who do not have Cython available, here's an alternative solution in pure python:

    import numpy as np
    import pandas as pd
    from datetime import datetime
    
    
    def parse_datetime(dt_array, cache=None):
        if cache is None:
            cache = {}
        date_time = np.empty(dt_array.shape[0], dtype=object)
        for i, (d_str, t_str) in enumerate(dt_array):
            try:
                year, month, day = cache[d_str]
            except KeyError:
                year, month, day = [int(item) for item in d_str[:10].split('-')]
                cache[d_str] = year, month, day
            try:
                hour, minute, sec = cache[t_str]
            except KeyError:
                hour, minute, sec = [int(item) for item in t_str.split(':')]
                cache[t_str] = hour, minute, sec
            date_time[i] = datetime(year, month, day, hour, minute, sec)
        return pd.to_datetime(date_time)
    
    
    def read_csv(filename, cache=None):
        df = pd.read_csv(filename)
        df['date_time'] = parse_datetime(df.loc[:, ['date', 'time']].values, cache=cache)
        return df.set_index('date_time')
    

    With the following particular data set, the speedup is 150x+:

    $ ls -lh test.csv
    -rw-r--r--  1 blurrcat  blurrcat   1.2M Apr  8 12:06 test.csv
    $ head -n 4 data/test.csv
    user_id,provider,date,time,steps
    5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,17:47:00,25
    5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,17:09:00,4
    5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,19:10:00,67
    

    In ipython:

    In [1]: %timeit pd.read_csv('test.csv', parse_dates=[['date', 'time']])
    1 loops, best of 3: 10.3 s per loop
    In [2]: %timeit read_csv('test.csv', cache={})
    1 loops, best of 3: 62.6 ms per loop
    

    To limit memory usage, simply replace the dict cache with something like a LRU.

提交回复
热议问题