Speed-improvement on large pandas read_csv with datetime index

前端 未结 3 469
臣服心动
臣服心动 2021-01-31 23:47

I have enormous files that look like this:

05/31/2012,15:30:00.029,1306.25,1,E,0,,1306.25

05/31/2012,15:30:00.029,1306.25,8,E,0,,1306.25

I can easily rea

3条回答
  •  盖世英雄少女心
    2021-01-31 23:50

    An improvement of previous solution of Michael WS:

    • conversion to pandas.Timestamp is better to perform outside the Cython code
    • atoi and processing native-c strings is a little-bit faster than python funcs
    • the number of datetime-lib calls is reduced to one from 2 (+1 occasional for date)
    • microseconds are also processed

    NB! The date order in this code is day/month/year.

    All in all the code seems to be approximately 10 times faster than the original convert_date_cython. However if this is called after read_csv then on SSD hard drive the difference is total time is only few percents due to the reading overhead. I would guess that on regular HDD the difference would be even smaller.

    cimport numpy as np
    import datetime
    import numpy as np
    import pandas as pd
    from libc.stdlib cimport atoi, malloc, free 
    from libc.string cimport strcpy
    
    ### Modified code from Michael WS:
    ### https://stackoverflow.com/a/15812787/2447082
    
    def convert_date_fast(np.ndarray date_vec, np.ndarray time_vec):
        cdef int i, d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms
        cdef int N = len(date_vec)
        cdef np.ndarray out_ar = np.empty(N, dtype=np.object)  
        cdef bytes prev_date =  'xx/xx/xxxx'
        cdef char *date_str =  malloc(20)
        cdef char *time_str =  malloc(20)
    
        for i in range(N):
            if date_vec[i] != prev_date:
                prev_date = date_vec[i] 
                strcpy(date_str, prev_date) ### xx/xx/xxxx
                date_str[2] = 0 
                date_str[5] = 0 
                d_year = atoi(date_str+6)
                d_month = atoi(date_str+3)
                d_day = atoi(date_str)
    
            strcpy(time_str, time_vec[i])   ### xx:xx:xx:xxxxxx
            time_str[2] = 0
            time_str[5] = 0
            time_str[8] = 0
            t_hour = atoi(time_str)
            t_min = atoi(time_str+3)
            t_sec = atoi(time_str+6)
            t_ms = atoi(time_str+9)
    
            out_ar[i] = datetime.datetime(d_year, d_month, d_day, t_hour, t_min, t_sec, t_ms)
        free(date_str)
        free(time_str)
        return pd.to_datetime(out_ar)
    

提交回复
热议问题