For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format.
I\'ve already tried different
Well my findings are not much related to pandas, but rather some common pitfalls.
Your code:
(genel_deneme) ➜ derp time python a.py
python a.py 38.62s user 0.69s system 100% cpu 39.008 total
Replace re.sub(r"[^\d.]", "", x) with precompiled version and use it in your lambdas
Result :
(genel_deneme) ➜ derp time python a.py
python a.py 26.42s user 0.69s system 100% cpu 26.843 total
replace np.float32 with float and run your code.
My Result:
(genel_deneme) ➜ derp time python a.py
python a.py 14.79s user 0.60s system 102% cpu 15.066 total
Find another way to achieve the result with the floats. More on this issue https://stackoverflow.com/a/6053175/37491