Read CSV files faster in Julia

不羁岁月 提交于 2021-01-27 05:40:50

问题


I have noticed that loading a CSV file using CSV.read is quite slow. For reference, I am attaching one example of time benchmark:

using CSV, DataFrames
file = download("https://github.com/foursquare/twofishes")
@time CSV.read(file, DataFrame)

Output: 
9.450861 seconds (22.77 M allocations: 960.541 MiB, 5.48% gc time)
297 rows × 2 columns

This is a random dataset, and a python alternate of such operation compiles in fraction of time compared to Julia. Since, julia is faster than python why is this operation takes this much time? Moreover, is there any faster alternate to reduce the compile timing?


回答1:


You are measuring the compile together with runtime.

One correct way to measure the time would be:

@time CSV.read(file, DataFrame)
@time CSV.read(file, DataFrame)

At the first run the function compiles at the second run you can use it.

Another option is using BenchmarkTools:

using BenchmarkTools
@btime CSV.read(file, DataFrame)

Normally, one uses Julia to work with huge datasets so that single initial compile time is not important. However, it is possible to compile CSV and DataFrame into Julia's system image and have fast execution from the first run, for isntructions see here: Why julia takes long time to import a package? (this is however more advanced usually one does not need it)

You also have yet another option which is reducing the optimization level for the compiler (this would be for scenarios where your workload is small and restarted frequently and you do not want all complexity that comes with image building. In this cage you would run Julia as:

julia --optimize=0 my_code.jl

Finally, like mentioned by @Oscar Smith in the forthcoming Julia 1.6 the compile times will be slightly shorter.



来源:https://stackoverflow.com/questions/65660180/read-csv-files-faster-in-julia

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!