How to grep lines between two patterns in a big file with python

前端 未结 5 1961
青春惊慌失措
青春惊慌失措 2021-01-03 12:45

I have a very big file, like this:

[PATTERN1]
line1
line2
line3 
...
...
[END PATTERN]
[PATTERN2]
line1 
line2
...
...
[END PATTERN]

I need to extract

相关标签:
5条回答
  • 2021-01-03 12:54

    I am kind of a new python programmer so I only barely understand your solution, but it seems like there is a lot of unnecessary iteration going on. First you read in the file, then you iterate through the file once for each item in name_list. Also, I don't know if you plan to iterate over newfile later to actually write it to a file.

    Here is how I would do it, though I realize it isn't the most pythonic looking solution. You'll only iterate over the file once though. (As a disclaimer, I didn't test this out.)

    patterns = {'startPattern1':"endPattern1", 'startPattern2':"endPattern2", 'startPattern3':"endPattern3"}
    
    fileIn = open(filenameIn, 'r')
    fileOut = open(filenameOut, 'w')
    targetEndPattern = None
    
    for line in fileIn:
       if targetEndPattern is not None:
           if line == targetEndPattern:
               targetEndPattern = None
           else:
               fileOut.write(line + "\n")
       elif line in patterns:
           targetEndPattern = patterns[line]
    

    EDIT: If you are expecting the patterns in a certain order, then this solution would have to be revised. I wrote this under the assumption that the order of the patterns doesn't matter but each start pattern matches a specific end pattern.

    0 讨论(0)
  • 2021-01-03 13:00

    I think this does the same thing your code does:

    FILE=open('myfile').readlines()
    
    newfile=[]
    
    pattern = None
    for line in FILE:
        if line[0] == "[" and line[-1] == "]":
            pattern = line[1:-1]
            if pattern == "END PATTERN":
                pattern = None
            continue
        elif pattern is not None and pattern in name_list:
            newfile.append(line)
    

    This way you go through all the lines only once, and fill your list as you go.

    0 讨论(0)
  • 2021-01-03 13:09

    Use something like

    import re
    
    START_PATTERN = '^START-PATTERN$'
    END_PATTERN = '^END-PATTERN$'
    
    with open('myfile') as file:
        match = False
        newfile = None
    
        for line in file:
            if re.match(START_PATTERN, line):
                match = True
                newfile = open('my_new_file.txt', 'w')
                continue
            elif re.match(END_PATTERN, line):
                match = False
                newfile.close()
                continue
            elif match:
                newfile.write(line)
                newfile.write('\n')
    

    This will iterate the file without reading it all into memory. It also writes directly to your new file, rather than appending to a list in memory. If your source is large enough that too may become an issue.

    Obviously there are numerous modifications you may need to make to this code; perhaps a regex pattern is not required to match a start/end line, in which case replace it with something like if 'xyz' in line.

    0 讨论(0)
  • 2021-01-03 13:13

    Consider:

    # hi
    # there
    # begin
    # need
    # this
    # stuff
    # end
    # skip
    # this
    
    with open(__file__) as fp:
        for line in iter(fp.readline, '# begin\n'):
            pass
        for line in iter(fp.readline, '# end\n'):
            print line
    

    prints "need this stuff"

    More flexible (e.g. to allow re pattern matching) is to use itertools drop- and takewhile:

    with open(__file__) as fp:
        result = list(itertools.takewhile(lambda x: 'end' not in x, 
            itertools.dropwhile(lambda x: 'begin' not in x, fp)))
    
    0 讨论(0)
  • 2021-01-03 13:18

    i would go with a generator-based solution

    #!/usr/bin/env python    
    start_patterns = ('PATTERN1', 'PATTERN2')
    end_patterns = ('END PATTERN')
    
    def section_with_bounds(gen):
      section_in_play = False
      for line in gen:
        if line.startswith(start_patterns):
          section_in_play = True
        if section_in_play:
          yield line
        if line.startswith(end_patterns):
          section_in_play = False
    
    with open("text.t2") as f:
      gen = section_with_bounds(f)
      for line in gen:
        print line
    
    0 讨论(0)
提交回复
热议问题