发表新帖

发表新帖

Efficient way to read 15 M lines csv files in python

前端未结

关注

 2  2093

花落未央 2021-02-01 07:39

For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format.

I\'ve already tried different

2条回答

南笙 (楼主)

2021-02-01 07:58
Well my findings are not much related to pandas, but rather some common pitfalls.
```
Your code: 
(genel_deneme) ➜  derp time python a.py
python a.py  38.62s user 0.69s system 100% cpu 39.008 total
```
1. precompile your regex
```
Replace re.sub(r"[^\d.]", "", x) with precompiled version and use it in your lambdas
Result : 
(genel_deneme) ➜  derp time python a.py 
python a.py  26.42s user 0.69s system 100% cpu 26.843 total
```
1. Try to find a better way then directly using np.float32, since it's 6-10 times slower than i think you expect it to be. Following is not what you want, but i just want to show the issue here.
```
replace np.float32 with float and run your code. 
My Result:  
(genel_deneme) ➜  derp time python a.py
python a.py  14.79s user 0.60s system 102% cpu 15.066 total
```
Find another way to achieve the result with the floats. More on this issue https://stackoverflow.com/a/6053175/37491
1. Divide your file and the work to subprocesses if you can. You already work on separate chunks of constant size. So basically you can divide the file and handle the job in separate processes using multiprocessing or threads.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题