Repeatedly extract a line between two delimiters in a text file, Python

后端 未结 4 687
情话喂你
情话喂你 2020-12-06 11:07

I have a text file in the following format:

DELIMITER1
extract me
extract me
extract me
DELIMITER2

I\'d like to extract every block of

相关标签:
4条回答
  • 2020-12-06 11:25

    If the delimiters are within a line:

    def get_sentences(filename):
        with open(filename) as file_contents:
            d1, d2 = '.', ',' # just example delimiters
            for line in file_contents:
                i1, i2 = line.find(d1), line.find(d2)
                if -1 < i1 < i2:
                    yield line[i1+1:i2]
    
    
    sentences = list(get_sentences('path/to/my/file'))
    

    If they are on their own lines:

    def get_sentences(filename):
        with open(filename) as file_contents:
            d1, d2 = '.', ',' # just example delimiters
            results = []
            for line in file_contents:
                if d1 in line:
                    results = []
                elif d2 in line:
                    yield results
                else:
                    results.append(line)
    
    sentences = list(get_sentences('path/to/my/file'))
    
    0 讨论(0)
  • 2020-12-06 11:25

    This should do what you want:

    import re
    def GetTheSentences(file):
        start_rx = re.compile('DELIMITER')
        end_rx = re.compile('DELIMITER2')
    
        start = False
        output = []
        with open(file, 'rb') as datafile:
             for line in datafile.readlines():
                 if re.match(start_rx, line):
                     start = True
                 elif re.match(end_rx, line):
                     start = False
                 if start:
                      output.append(line)
        return output
    

    Your previous version looks like it's supposed to be an iterator function. Do you want your output returned one item at a time? That's slightly different.

    0 讨论(0)
  • 2020-12-06 11:27

    This is a good job for List comprehensions, no regex required. First list comp scrubs the typical \n in the text line list found when opening txt file. Second list comp just uses in operator to identify sequence patterns to filter.

    def extract_lines(file):
        scrubbed = [x.strip('\n') for x in open(file, 'r')]
        return [x for x in scrubbed if x not in ('DELIMITER1','DELIMITER2')]
    
    0 讨论(0)
  • 2020-12-06 11:35

    You can simplify this to one regular expression using re.S, the DOTALL flag.

    import re
    def GetTheSentences(infile):
         with open(infile) as fp:
             for result in re.findall('DELIMITER1(.*?)DELIMITER2', fp.read(), re.S):
                 print result
    # extract me
    # extract me
    # extract me
    

    This also makes use of the non-greedy operator .*?, so multiple non-overlapping blocks of DELIMITER1-DELIMITER2 pairs will all be found.

    0 讨论(0)
提交回复
热议问题