I have a number of very large text files which I need to process, the largest being about 60GB.
Each line has 54 characters in seven fields and I want to remove the
I'll add this answer to explain why buffering makes sense and also offer one more solution
You are getting breathtakingly bad performance. This article Is it possible to speed-up python IO? shows that a 10 gb read should take in the neighborhood of 3 minutes. Sequential write is the same speed. So you're missing a factor of 30 and your performance target is still 10 times slower than what ought to be possible.
Almost certainly this kind of disparity lies in the number of head seeks the disk is doing. A head seek takes milliseconds. A single seek corresponds to several megabytes of sequential read-write. Enormously expensive. Copy operations on the same disk require seeking between input and output. As has been stated, one way to reduce seeks is to buffer in such a way that many megabytes are read before writing to disk and vice versa. If you can convince the python io system to do this, great. Otherwise you can read and process lines into a string array and then write after perhaps 50 mb of output are ready. This size means a seek will induce a <10% performance hit with respect to the data transfer itself.
The other very simple way to eliminate seeks between input and output files altogether is to use a machine with two physical disks and fully separate io channels for each. Input from one. Output to other. If you're doing lots of big file transformations, it's good to have a machine with this feature.
Your code is rather un-idiomatic and makes far more function calls than needed. A simpler version is:
ProcessLargeTextFile():
with open("filepath") as r, open("output") as w:
for line in r:
fields = line.split(' ')
fields[0:2] = [fields[0][:-3],
fields[1][:-3],
fields[2][:-3]]
w.write(' '.join(fields))
and I don't know of a modern filesystem that is slower than Windows. Since it appears you are using these huge data files as databases, have you considered using a real database?
Finally, if you are just interested in reducing file size, have you considered compressing / zipping the files?
As you don't seem to be limited by CPU, but rather by I/O, have you tried with some variations on the third parameter of open
?
Indeed, this third parameter can be used to give the buffer size to be used for file operations!
Simply writing open( "filepath", "r", 16777216 )
will use 16 MB buffers when reading from the file. It must help.
Use the same for the output file, and measure/compare with identical file for the rest.
Note: This is the same kind of optimization suggested by other, but you can gain it here for free, without changing your code, without having to buffer yourself.
Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files. It will run smoothly on any kind of machine, you just need to configure CHUNK_SIZE based on your system RAM. More the CHUNK_SIZE, more will be the data read at a time
https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d
download the file data_loading_utils.py and import it into your code
usage
import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000
def process_lines(line, eof, file_name):
# check if end of file reached
if not eof:
# process data, data is one single line of the file
else:
# end of file reached
data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=process_lines)
process_lines method is the callback function. It will be called for all the lines, with parameter line representing one single line of the file at a time.
You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.
Those seem like very large files... Why are they so large? What processing are you doing per line? Why not use a database with some map reduce calls (if appropriate) or simple operations of the data? The point of a database is to abstract the handling and management large amounts of data that can't all fit in memory.
You can start to play with the idea with sqlite3 which just uses flat files as databases. If you find the idea useful then upgrade to something a little more robust and versatile like postgresql.
Create a database
conn = sqlite3.connect('pts.db')
c = conn.cursor()
Creates a table
c.execute('''CREATE TABLE ptsdata (filename, line, x, y, z''')
Then use one of the algorithms above to insert all the lines and points in the database by calling
c.execute("INSERT INTO ptsdata VALUES (filename, lineNumber, x, y, z)")
Now how you use it depends on what you want to do. For example to work with all the points in a file by doing a query
c.execute("SELECT lineNumber, x, y, z FROM ptsdata WHERE filename=file.txt ORDER BY lineNumber ASC")
And get n
lines at a time from this query with
c.fetchmany(size=n)
I'm sure there is a better wrapper for the sql statements somewhere, but you get the idea.