reading csv in Julia is slow compared to Python

后端 未结 7 1961
北荒
北荒 2020-12-16 11:10

reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns. <

相关标签:
7条回答
  • 2020-12-16 11:25

    Note that the "n bytes allocated" output from @time is the total size of all allocated objects, ignoring how many of them might have been freed. This number is often much higher than the final size of live objects in memory. I don't know if this is what your memory size estimate is based on, but I wanted to point this out.

    0 讨论(0)
  • 2020-12-16 11:27

    Let us first create a file you are talking about to provide reproducibility:

    open("myFile.txt", "w") do io
        foreach(i -> println(io, join(i+1:i+644, '|')), 1:153895)
    end
    

    Now I read this file in in Julia 1.4.2 and CSV.jl 0.7.1.

    Single threaded:

    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      4.747160 seconds (1.55 M allocations: 1.281 GiB, 4.29% gc time)
    
    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      2.780213 seconds (13.72 k allocations: 1.206 GiB, 5.80% gc time)
    

    and using e.g. 4 threads:

    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      4.546945 seconds (6.02 M allocations: 1.499 GiB, 5.05% gc time)
    
    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      0.812742 seconds (47.28 k allocations: 1.208 GiB)
    

    In R it is:

    > system.time(myData<-read.delim("myFile.txt",sep="|",header=F,
    +                                stringsAsFactors=F,na.strings=""))
       user  system elapsed 
     28.615   0.436  29.048 
    

    In Python (Pandas) it is:

    >>> import pandas as pd
    >>> import time
    >>> start=time.time()
    >>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
    >>> print(time.time()-start)
    25.95710587501526
    

    Now if we test fread from R (which is fast) we get:

    > system.time(fread("myFile.txt", sep="|", header=F,
                        stringsAsFactors=F, na.strings="", nThread=1))
       user  system elapsed 
      1.043   0.036   1.082 
    > system.time(fread("myFile.txt", sep="|", header=F,
                        stringsAsFactors=F, na.strings="", nThread=4))
       user  system elapsed 
      1.361   0.028   0.416 
    

    So in this case the summary is:

    • despite the cost of compilation of CSV.File in Julia when you run it for the first time it is significantly faster than base R or Python
    • it is comparable in speed to fread in R (in this case slightly slower, but other benchmark made here shows cases when it is faster)

    EDIT: Following the request I have added a benchmark for a small file: 10 columns, 100,000 rows Julia vs Pandas.

    Data preparation step:

    open("myFile.txt", "w") do io
        foreach(i -> println(io, join(i+1:i+10, '|')), 1:100_000)
    end
    

    CSV.jl, single threaded:

    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      1.898649 seconds (1.54 M allocations: 93.848 MiB, 1.48% gc time)
    
    julia> @time CSV.File("myFile.txt", delim='|', header=false);
      0.029965 seconds (248 allocations: 17.037 MiB)
    

    Pandas:

    >>> import pandas as pd
    >>> import time
    >>> start=time.time()
    >>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
    >>> print(time.time()-start)
    0.07587623596191406
    

    Conclusions:

    • the compilation cost is a one-time cost that has to be paid and it is constant (roughly it does not depend on how big is the file you want to read in)
    • for small files CSV.jl is faster than Pandas (if we exclude compilation cost)

    Now, if you would like to avoid having to pay compilation cost on every fresh Julia session this is doable with https://github.com/JuliaLang/PackageCompiler.jl.

    From my experience, if you are doing data science work, where e.g. you read-in thousands of CSV files, I do not have a problem with waiting 2 seconds for the compilation, if later I can save hours. It takes more than 2 seconds to write the code that reads in the files.

    Of course - if you write a script that does little work and terminates after it is done then it is a different use case as compilation time would be a majority of computational cost actually. In this case using PackageCompiler.jl is a strategy I use.

    0 讨论(0)
  • 2020-12-16 11:28

    There is a relatively new julia package called CSV.jl by Jacob Quinn that provides a much faster CSV parser, in many cases on par with pandas: https://github.com/JuliaData/CSV.jl

    0 讨论(0)
  • 2020-12-16 11:28

    I've found a few things that can partially help this situation.

    1. using the readdlm() function in Julia seems to work considerably faster (e.g. 3x on a recent trial) than readtable(). Of course, if you want the DataFrame object type, you'll then need to convert to it, which may eat up most or all of the speed improvement.

    2. Specifying dimensions of your file can make a BIG difference, both in speed and in memory allocations. I ran this trial reading in a file that is 258.7 MB on disk:

      julia> @time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1);
      19.072266 seconds (221.60 M allocations: 6.573 GB, 3.34% gc time)
      
      julia> @time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1, dims = (File_Lengths[1], 62));
      10.309866 seconds (87 allocations: 528.331 MB, 0.03% gc time)
      
    3. The type specification for your object matters a lot. For instance, if your data has strings in it, then the data of the array that you read in will be of type Any, which is expensive memory wise. If memory is really an issue, you may want to consider preprocessing your data by first converting the strings to integers, doing your computations, and then converting back. Also, if you don't need a ton of precision, using Float32 type instead of Float64 can save a LOT of space. You can specify this when reading the file in, e.g.:

      Data = readdlm("file.csv", ',', Float32)

    4. Regarding memory usage, I've found in particular that the PooledDataArray type (from the DataArrays package) can be helpful in cutting down memory usage if your data has a lot of repeated values. The time to convert to this type is relatively large, so this isn't a time saver per se, but at least helps reduce the memory usage somewhat. E.g. when loading a data set with 19 million rows and 36 columns, 8 of which represented categorical variables for statistical analysis, this reduced the memory allocation of the object from 5x its size on disk to 4x its size. If there are even more repeated values, the memory reduction can be even more significant (I've had situations where the PooledDataArray cuts memory allocation in half).

    5. It can also sometimes help to run the gc() (garbage collector) function after loading and formatting data to clear out any unneeded ram allocation, though generally Julia will do this automatically pretty well.

    Still though, despite all of this, I'll be looking forward to further developments on Julia to enable faster loading and more efficient memory usage for large data sets.

    0 讨论(0)
  • 2020-12-16 11:30

    In my experience, the best way to deal with larger text files is not load them up into Julia, but rather to stream them. This method has some additional fixed costs, but generally runs extremely quickly. Some pseudo code is this:

    function streamdat()
        mycsv=open("/path/to/text.csv", "r")   # <-- opens a path to your text file
    
        sumvec = [0.0]                # <-- store a sum  here
        i = 1
        while(!eof(mycsv))            # <-- loop through each line of the file
           row = readline(mycsv) 
           vector=split(row, "|")     # <-- split each line by |
           sumvec+=parse(Float64, vector[i]) 
           i+=1
        end
    end
    
    streamdat()
    

    The code above is just a simple sum, but this logic can be expanded to more complex problems.

    0 讨论(0)
  • 2020-12-16 11:33
    using CSV
    @time df=CSV.read("C:/Users/hafez/personal/r/tutorial for students/Book2.csv")
    

    recently I tried in Julia 1.4.2. I found different response and at first, I didn't understand Julia. then I posted the same thing in the Julia discussion forums. then I understood that this code will provide only compile time. here you can find benchmark

    0 讨论(0)
提交回复
热议问题