chunk a text database into N equal blocks and retain header

后端 未结 1 1120
暖寄归人
暖寄归人 2021-01-21 22:02

I have several large (30+ million lines) text databases which I am cleaning up with the following code, I need to split the file into 1 million lines or less and retain the head

相关标签:
1条回答
  • 2021-01-21 22:24

    You can do something like this:

    with open('file') as file:
      lines = file.readlines()
    
    headers = lines[0:1]
    rest = lines[1:]
    chunk_size = 4
    
    def chunks(lst, chunk_size):
      for i in xrange(0, len(lst), chunk_size):
        yield lst[i:i + chunk_size]
    
    def write_rows(rows, file):
      for row in rows:
        file.write('%s' % row)
    
    part = 1
    for chunk in chunks(rest, chunk_size):
      with open('part%d' % part, 'w') as file:
        write_rows(headers, file)
        write_rows(chunk, file)
      part += 1
    

    Here's a test run:

    $ cat file && python mkt.py && for p in part*; do echo ---- $p; cat $p; done
    header
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    ---- part1
    header
    1
    2
    3
    4
    ---- part2
    header
    5
    6
    7
    8
    ---- part3
    header
    9
    10
    11
    12
    ---- part4
    header
    13
    14
    

    Obviously, change the values of the chunk_size and how you fetch headers depending on their count.

    Credits:

    • https://stackoverflow.com/a/312464/438544

    Edit - to do this line-by-line to avoid memory issues, you can do something like this:

    from itertools import islice
    
    headers_count = 5
    chunk_size = 250000
    
    with open('file') as fin:
      headers = list(islice(fin, headers_count))
    
      part = 1
      while True:
        line_iter = islice(fin, chunk_size)
        try:
          first_line = line_iter.next()
        except StopIteration:
          break
        with open('part%d' % part, 'w') as fout:
          for line in headers:
            fout.write(line)
          fout.write(first_line)
          for line in line_iter:
            fout.write(line)
        part += 1
    

    Credits:

    • Python how to read N number of lines at a time

    Test case (put the above in the file called mkt2.py):

    Make a file containing 5-line header and 1234567 lines in it:

    with open('file', 'w') as fout:
      for i in range(5):
        fout.write(10 * ('header %d ' % i) + '\n')
      for i in range(1234567):
        fout.write(10 * ('line %d ' % i) + '\n')
    

    Shell script to test (put in file called rt.sh):

    rm part*
    echo ---- file
    head -n7 file
    tail -n2 file
    
    python mkt2.py
    
    for i in part*; do
      echo ---- $i
      head -n7 $i
      tail -n2 $i
    done
    

    Sample output:

    $ sh rt.sh 
    ---- file
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 
    line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 
    line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 
    line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 
    ---- part1
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 
    line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 
    line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 
    line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 
    ---- part2
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 
    line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 
    line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 
    line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 
    ---- part3
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 
    line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 
    line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 
    line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 
    ---- part4
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 
    line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 
    line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 
    line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 
    ---- part5
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 
    line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 
    line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 
    line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 
    

    Timing of the above:

    real    0m0.935s
    user    0m0.708s
    sys     0m0.200s
    

    Hope this helps.

    0 讨论(0)
提交回复
热议问题