Numpy loading csv TOO slow compared to Matlab

前端 未结 5 1610
無奈伤痛
無奈伤痛 2020-12-01 05:08

I posted this question because I was wondering whether I did something terribly wrong to get this result.

I have a medium-size csv file and I tried to use numpy to l

相关标签:
5条回答
  • 2020-12-01 05:53

    FWIW the built-in csv module works great and really is not that verbose.

    csv module:

    %%timeit
    with open('test.csv', 'r') as f:
        np.array([l for l in csv.reader(f)])
    
    
    1 loop, best of 3: 1.62 s per loop
    

    np.loadtext:

    %timeit np.loadtxt('test.csv', delimiter=',')
    
    1 loop, best of 3: 16.6 s per loop
    

    pd.read_csv:

    %timeit pd.read_csv('test.csv', header=None).values
    
    1 loop, best of 3: 663 ms per loop
    

    Personally I like using pandas read_csv but the csv module is nice when I'm using pure numpy.

    0 讨论(0)
  • 2020-12-01 05:54

    Yeah, reading csv files into numpy is pretty slow. There's a lot of pure Python along the code path. These days, even when I'm using pure numpy I still use pandas for IO:

    >>> import numpy as np, pandas as pd
    >>> %time d = np.genfromtxt("./test.csv", delimiter=",")
    CPU times: user 14.5 s, sys: 396 ms, total: 14.9 s
    Wall time: 14.9 s
    >>> %time d = np.loadtxt("./test.csv", delimiter=",")
    CPU times: user 25.7 s, sys: 28 ms, total: 25.8 s
    Wall time: 25.8 s
    >>> %time d = pd.read_csv("./test.csv", delimiter=",").values
    CPU times: user 740 ms, sys: 36 ms, total: 776 ms
    Wall time: 780 ms
    

    Alternatively, in a simple enough case like this one, you could use something like what Joe Kington wrote here:

    >>> %time data = iter_loadtxt("test.csv")
    CPU times: user 2.84 s, sys: 24 ms, total: 2.86 s
    Wall time: 2.86 s
    

    There's also Warren Weckesser's textreader library, in case pandas is too heavy a dependency:

    >>> import textreader
    >>> %time d = textreader.readrows("test.csv", float, ",")
    readrows: numrows = 1500000
    CPU times: user 1.3 s, sys: 40 ms, total: 1.34 s
    Wall time: 1.34 s
    
    0 讨论(0)
  • 2020-12-01 05:59

    If you want to just save and read a numpy array its much better to save it as a binary or compressed binary depending on size:

    my_data = np.random.rand(1500000, 3)*10
    np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')
    np.save('./testy', my_data)
    np.savez('./testz', my_data)
    del my_data
    
    setup_stmt = 'import numpy as np'
    stmt1 = """\
    my_data = np.genfromtxt('./test.csv', delimiter=',')
    """
    stmt2 = """\
    my_data = np.load('./testy.npy')
    """
    stmt3 = """\
    my_data = np.load('./testz.npz')['arr_0']
    """
    
    t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
    t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)
    t3 = timeit.timeit(stmt=stmt3, setup=setup_stmt, number=3)
    
    genfromtxt 39.717250824
    save 0.0667860507965
    savez 0.268463134766
    
    0 讨论(0)
  • 2020-12-01 06:12

    Perhaps it's better to rig up a simple c code which converts the data to binary and have `numpy' read the binary file. I have a 20GB CSV file to read with the CSV data being a mixture of int, double, str. Numpy read-to-array of structs takes more than an hour, while dumping to binary took about 2 minutes and loading to numpy takes less than 2 seconds!

    My specific code, for example, is available here.

    0 讨论(0)
  • 2020-12-01 06:12

    I've performance-tested the suggested solutions with perfplot (a small project of mine) and found that

    pandas.read_csv(filename)
    

    is indeed the fastest solution (if more than 2000 entries are read, before that everything is in the range of milliseconds). It outperforms numpy's variants by a factor of about 10. (numpy.fromfile is here just for comparison, it cannot read actual csv files.)

    Code to reproduce the plot:

    import numpy
    import pandas
    import perfplot
    
    numpy.random.seed(0)
    filename = "a.txt"
    
    
    def setup(n):
        a = numpy.random.rand(n)
        numpy.savetxt(filename, a)
        return None
    
    
    def numpy_genfromtxt(data):
        return numpy.genfromtxt(filename)
    
    
    def numpy_loadtxt(data):
        return numpy.loadtxt(filename)
    
    
    def numpy_fromfile(data):
        out = numpy.fromfile(filename, sep=" ")
        return out
    
    
    def pandas_readcsv(data):
        return pandas.read_csv(filename, header=None).values.flatten()
    
    
    def kington(data):
        delimiter = " "
        skiprows = 0
        dtype = float
    
        def iter_func():
            with open(filename, 'r') as infile:
                for _ in range(skiprows):
                    next(infile)
                for line in infile:
                    line = line.rstrip().split(delimiter)
                    for item in line:
                        yield dtype(item)
            kington.rowlength = len(line)
    
        data = numpy.fromiter(iter_func(), dtype=dtype).flatten()
        return data
    
    
    perfplot.show(
        setup=setup,
        kernels=[numpy_genfromtxt, numpy_loadtxt, numpy_fromfile, pandas_readcsv, kington],
        n_range=[2 ** k for k in range(20)],
        logx=True,
        logy=True,
    )
    
    0 讨论(0)
提交回复
热议问题