Split large files using python

前端未结

关注

 5  1092

I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file. But there are tw

相关标签:

5条回答

走了就别回头了

2020-12-28 21:33

NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
    fout = open("output0.txt","wb")
    for i,line in enumerate(fin):
      fout.write(line)
      if (i+1)%NUM_OF_LINES == 0:
        fout.close()
        fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")

    fout.close()

0 讨论(0)

小蘑菇

2020-12-28 21:49
If there's nothing special about having a specific number of file lines in each file, the readlines() function also accepts a size 'hint' parameter that behaves like this:

If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. This is often used to allow efficient reading of a large file by lines, but without having to load the entire file in memory. Only complete lines will be returned.

...so you could write that code something like this:
```
# assume that an average line is about 80 chars long, and that we want about 
# 40K in each file.

SIZE_HINT = 80 * 40000

fileNumber = 0
with open("inputFile.txt", "rt") as f:
   while True:
      buf = f.readlines(SIZE_HINT)
      if not buf:
         # we've read the entire file in, so we're done.
         break
      outFile = open("outFile%d.txt" % fileNumber, "wt")
      outFile.write(buf)
      outFile.close()
      fileNumber += 1 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

青春惊慌失措

2020-12-28 21:54

chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
    if i % chunk_size == 0:
        if fout: fout.close()
        fout = open('output%d.txt' % (i/chunk_size), 'w')
    fout.write(line)
fout.close()

0 讨论(0)

有刺的猬

2020-12-28 21:55
For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do:
1. Open the input file.
2. Open the first output file.
3. Read one line from the input file and write it to the output file.
4. Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.
5. Repeat steps 3-4 until you've reached the end of the input file.
6. Close both files.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2020-12-28 21:55

Obviously, as you are doing work on the file, you will need to iterate over the file's contents in some way -- whether you do that manually or you let a part of the Python API do it for you (e.g. the readlines() method) is not important. In big O analysis, this means you will spend O(n) time (n being the size of the file).

But reading the file into memory requires O(n) space also. Although sometimes we do need to read a 10 gb file into memory, your particular problem does not require this. We can iterate over the file object directly. Of course, the file object does require space, but we have no reason to hold the contents of the file twice in two different forms.

Therefore, I would go with your second solution.

0 讨论(0)
发布评论:

提交评论
- 加载中...