Speed-improvement on large pandas read_csv with datetime index

前端未结

关注

 3  483

臣服心动

I have enormous files that look like this:

05/31/2012,15:30:00.029,1306.25,1,E,0,,1306.25

05/31/2012,15:30:00.029,1306.25,8,E,0,,1306.25

I can easily rea

相关标签:

3条回答

盖世英雄少女心

2021-01-31 23:50

An improvement of previous solution of Michael WS:

conversion to pandas.Timestamp is better to perform outside the Cython code
atoi and processing native-c strings is a little-bit faster than python funcs
the number of datetime-lib calls is reduced to one from 2 (+1 occasional for date)
microseconds are also processed

NB! The date order in this code is day/month/year.

All in all the code seems to be approximately 10 times faster than the original convert_date_cython. However if this is called after read_csv then on SSD hard drive the difference is total time is only few percents due to the reading overhead. I would guess that on regular HDD the difference would be even smaller.

cimport numpy as np
import datetime
import numpy as np
import pandas as pd
from libc.stdlib cimport atoi, malloc, free 
from libc.string cimport strcpy

### Modified code from Michael WS:
### https://stackoverflow.com/a/15812787/2447082

def convert_date_fast(np.ndarray date_vec, np.ndarray time_vec):
    cdef int i, d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms
    cdef int N = len(date_vec)
    cdef np.ndarray out_ar = np.empty(N, dtype=np.object)  
    cdef bytes prev_date = <bytes> 'xx/xx/xxxx'
    cdef char *date_str = <char *> malloc(20)
    cdef char *time_str = <char *> malloc(20)

    for i in range(N):
        if date_vec[i] != prev_date:
            prev_date = date_vec[i] 
            strcpy(date_str, prev_date) ### xx/xx/xxxx
            date_str[2] = 0 
            date_str[5] = 0 
            d_year = atoi(date_str+6)
            d_month = atoi(date_str+3)
            d_day = atoi(date_str)

        strcpy(time_str, time_vec[i])   ### xx:xx:xx:xxxxxx
        time_str[2] = 0
        time_str[5] = 0
        time_str[8] = 0
        t_hour = atoi(time_str)
        t_min = atoi(time_str+3)
        t_sec = atoi(time_str+6)
        t_ms = atoi(time_str+9)

        out_ar[i] = datetime.datetime(d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms)
    free(date_str)
    free(time_str)
    return pd.to_datetime(out_ar)

0 讨论(0)

無奈伤痛

2021-01-31 23:59

I got an incredible speedup (50X) with the following cython code:

call from python: timestamps = convert_date_cython(df["date"].values, df["time"].values)

cimport numpy as np
import pandas as pd
import datetime
import numpy as np
def convert_date_cython(np.ndarray date_vec, np.ndarray time_vec):
    cdef int i
    cdef int N = len(date_vec)
    cdef out_ar = np.empty(N, dtype=np.object)
    date = None
    for i in range(N):
        if date is None or date_vec[i] != date_vec[i - 1]:
            dt_ar = map(int, date_vec[i].split("/"))
            date = datetime.date(dt_ar[2], dt_ar[0], dt_ar[1])
        time_ar = map(int, time_vec[i].split(".")[0].split(":"))
        time = datetime.time(time_ar[0], time_ar[1], time_ar[2])
        out_ar[i] = pd.Timestamp(datetime.datetime.combine(date, time))
    return out_ar

0 讨论(0)

予麋鹿

2021-02-01 00:02

The cardinality of datetime strings is not huge. For example, number of time strings in the format %H-%M-%S is 24 * 60 * 60 = 86400. If the number of rows of your dataset is much larger than this or your data contains lots of duplicate timestamps, adding a cache in the parsing process could substantially speed things up.

For those who do not have Cython available, here's an alternative solution in pure python:

import numpy as np
import pandas as pd
from datetime import datetime


def parse_datetime(dt_array, cache=None):
    if cache is None:
        cache = {}
    date_time = np.empty(dt_array.shape[0], dtype=object)
    for i, (d_str, t_str) in enumerate(dt_array):
        try:
            year, month, day = cache[d_str]
        except KeyError:
            year, month, day = [int(item) for item in d_str[:10].split('-')]
            cache[d_str] = year, month, day
        try:
            hour, minute, sec = cache[t_str]
        except KeyError:
            hour, minute, sec = [int(item) for item in t_str.split(':')]
            cache[t_str] = hour, minute, sec
        date_time[i] = datetime(year, month, day, hour, minute, sec)
    return pd.to_datetime(date_time)


def read_csv(filename, cache=None):
    df = pd.read_csv(filename)
    df['date_time'] = parse_datetime(df.loc[:, ['date', 'time']].values, cache=cache)
    return df.set_index('date_time')

With the following particular data set, the speedup is 150x+:

$ ls -lh test.csv
-rw-r--r--  1 blurrcat  blurrcat   1.2M Apr  8 12:06 test.csv
$ head -n 4 data/test.csv
user_id,provider,date,time,steps
5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,17:47:00,25
5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,17:09:00,4
5480312b6684e015fc2b12bc,fitbit,2014-11-02 00:00:00,19:10:00,67

In ipython:

In [1]: %timeit pd.read_csv('test.csv', parse_dates=[['date', 'time']])
1 loops, best of 3: 10.3 s per loop
In [2]: %timeit read_csv('test.csv', cache={})
1 loops, best of 3: 62.6 ms per loop

To limit memory usage, simply replace the dict cache with something like a LRU.

0 讨论(0)