I\'ve profiled some legacy code I\'ve inherited with cProfile. There were a bunch of changes I\'ve already made that have helped (like using simplejson\'s C extensions!).
Actually, your problem is not that file.write()
takes 20% of your time. Its that 80% of the time you aren't in file.write()
!
Writing to the disk is slow. There is really nothing you can do about it. It simply takes a very large amount of time to write stuff out to disk. There is almost nothing you can do to speed it up.
What you want is for that I/O time to be the biggest part of the program so that your speed is limited by the speed of the hard disk not your processing time. The ideal is for file.write()
to have 100% usage!
You can do mmap in python, which might help. But I suspect you did some mistake while profiling, because 7k * 1500 in 20 seconds is about 0.5 Mbytes/s. Do a test in which you write random lines with the same length, and you will see it's much faster than that.
Batching the writes into groups of 500 did indeed speed up the writes significantly. For this test case the writing rows individually took 21.051 seconds in I/O, while writing in batches of 117 took 5.685 seconds to write the same number of rows. Batches of 500 took a total of only 0.266 seconds.