I have several large (30+ million lines) text databases which I am cleaning up with the following code, I need to split the file into 1 million lines or less and retain the head
You can do something like this:
with open('file') as file:
lines = file.readlines()
headers = lines[0:1]
rest = lines[1:]
chunk_size = 4
def chunks(lst, chunk_size):
for i in xrange(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]
def write_rows(rows, file):
for row in rows:
file.write('%s' % row)
part = 1
for chunk in chunks(rest, chunk_size):
with open('part%d' % part, 'w') as file:
write_rows(headers, file)
write_rows(chunk, file)
part += 1
Here's a test run:
$ cat file && python mkt.py && for p in part*; do echo ---- $p; cat $p; done
header
1
2
3
4
5
6
7
8
9
10
11
12
13
14
---- part1
header
1
2
3
4
---- part2
header
5
6
7
8
---- part3
header
9
10
11
12
---- part4
header
13
14
Obviously, change the values of the chunk_size
and how you fetch headers
depending on their count.
Credits:
Edit - to do this line-by-line to avoid memory issues, you can do something like this:
from itertools import islice
headers_count = 5
chunk_size = 250000
with open('file') as fin:
headers = list(islice(fin, headers_count))
part = 1
while True:
line_iter = islice(fin, chunk_size)
try:
first_line = line_iter.next()
except StopIteration:
break
with open('part%d' % part, 'w') as fout:
for line in headers:
fout.write(line)
fout.write(first_line)
for line in line_iter:
fout.write(line)
part += 1
Credits:
Test case (put the above in the file called mkt2.py
):
Make a file containing 5-line header and 1234567 lines in it:
with open('file', 'w') as fout:
for i in range(5):
fout.write(10 * ('header %d ' % i) + '\n')
for i in range(1234567):
fout.write(10 * ('line %d ' % i) + '\n')
Shell script to test (put in file called rt.sh
):
rm part*
echo ---- file
head -n7 file
tail -n2 file
python mkt2.py
for i in part*; do
echo ---- $i
head -n7 $i
tail -n2 $i
done
Sample output:
$ sh rt.sh
---- file
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4
line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0
line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1
line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565
line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566
---- part1
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4
line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0
line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1
line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998
line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999
---- part2
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4
line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000
line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001
line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998
line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999
---- part3
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4
line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000
line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001
line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998
line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999
---- part4
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4
line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000
line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001
line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998
line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999
---- part5
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4
line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000
line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001
line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565
line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566
Timing of the above:
real 0m0.935s
user 0m0.708s
sys 0m0.200s
Hope this helps.