Extract specific text lines?

前端 未结 10 1079
庸人自扰
庸人自扰 2021-02-03 15:09

I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extr

10条回答
  •  闹比i
    闹比i (楼主)
    2021-02-03 15:20

    1. Try to read whole file

    One speed up you can do is read whole file in memory if that is possible, else read in chunks. You said 'several hudred thousand lines' lets say 1 million lines with each line 100 char i.e. around 100 MB, if you have that much free memory (I assume you have) just do this

    big_file = open('C:\\gbigfile.txt', 'r')
    big_file_lines = big_file.read_lines()
    big_file.close()
    small_file3 = open('C:\\small_file3.txt', 'w')
    for line in big_file_lines:
       if 'S0414' in line:
          small_file3.write(line)
    small_file3.close()
    

    Time this with orginal version and see if it makes difference, I think it will.

    But if your file is really big in GBs, then you can read it in chunks e.g. 100 MB chunks, split it into lines and search but don't forget to join lines at each 100MB interval (I can elaborate more if this is the case)

    file.readlines returns a list containing all the lines of data in the file. If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. This is often used to allow efficient reading of a large file by lines, but without having to load the entire file in memory. Only complete lines will be returned.

    Also see following link for speed difference between line by line vs entire file reading. http://handyfloss.wordpress.com/2008/02/15/python-speed-vs-memory-tradeoff-reading-files/

    2. Try to write whole file

    You can also store line and write them at once at end, though I am not sure if it will help much

    big_file = open('C:\\gbigfile.txt', 'r')
    big_file_lines = big_file.read_lines()
    small_file_lines = []
    for line in big_file_lines:
       if 'S0414' in line:
          small_file_lines.append(line)
    small_file3 = open('C:\\small_file3.txt', 'w')
    small_file3.write("".join(small_file_lines))
    small_file3.close()
    

    3. Try filter

    You can also try to use filter, instead of loop see if it makes difference

    small_file_lines= filter(lambda line:line.find('S0414') >= 0, big_file_lines)
    

提交回复
热议问题