问题
I have a very big file, like this:
[PATTERN1] line1 line2 line3 ... ... [END PATTERN] [PATTERN2] line1 line2 ... ... [END PATTERN]
I need to extract in another file, lines between a variable starter pattern [PATTERN1] and another define pattern [END PATTERN], only for some specific starter pattern.
For example:
[PATTERN2] line1 line2 ... ... [END PATTERN]
I already do the same thing, with a smaller file, using this code:
FILE=open('myfile').readlines()
newfile=[]
for n in name_list:
A = FILE[[s for s,name in enumerate(FILE) if n in name][0]:]
B = A[:[e+1 for e,end in enumerate(A) if 'END PATTERN' in end][0]]
newfile.append(B)
Where 'name_list' is a list with the specific starter patterns that I need.
It works!! but I suppose there is a better way to do this working with big files, without using the .readlines() command.
Anyone can help me?
thanks a lot!
回答1:
Use something like
import re
START_PATTERN = '^START-PATTERN$'
END_PATTERN = '^END-PATTERN$'
with open('myfile') as file:
match = False
newfile = None
for line in file:
if re.match(START_PATTERN, line):
match = True
newfile = open('my_new_file.txt', 'w')
continue
elif re.match(END_PATTERN, line):
match = False
newfile.close()
continue
elif match:
newfile.write(line)
newfile.write('\n')
This will iterate the file without reading it all into memory. It also writes directly to your new file, rather than appending to a list in memory. If your source is large enough that too may become an issue.
Obviously there are numerous modifications you may need to make to this code; perhaps a regex pattern is not required to match a start/end line, in which case replace it with something like if 'xyz' in line
.
回答2:
Consider:
# hi
# there
# begin
# need
# this
# stuff
# end
# skip
# this
with open(__file__) as fp:
for line in iter(fp.readline, '# begin\n'):
pass
for line in iter(fp.readline, '# end\n'):
print line
prints "need this stuff"
More flexible (e.g. to allow re pattern matching) is to use itertools drop- and takewhile:
with open(__file__) as fp:
result = list(itertools.takewhile(lambda x: 'end' not in x,
itertools.dropwhile(lambda x: 'begin' not in x, fp)))
回答3:
I think this does the same thing your code does:
FILE=open('myfile').readlines()
newfile=[]
pattern = None
for line in FILE:
if line[0] == "[" and line[-1] == "]":
pattern = line[1:-1]
if pattern == "END PATTERN":
pattern = None
continue
elif pattern is not None and pattern in name_list:
newfile.append(line)
This way you go through all the lines only once, and fill your list as you go.
回答4:
I am kind of a new python programmer so I only barely understand your solution, but it seems like there is a lot of unnecessary iteration going on. First you read in the file, then you iterate through the file once for each item in name_list
. Also, I don't know if you plan to iterate over newfile
later to actually write it to a file.
Here is how I would do it, though I realize it isn't the most pythonic looking solution. You'll only iterate over the file once though. (As a disclaimer, I didn't test this out.)
patterns = {'startPattern1':"endPattern1", 'startPattern2':"endPattern2", 'startPattern3':"endPattern3"}
fileIn = open(filenameIn, 'r')
fileOut = open(filenameOut, 'w')
targetEndPattern = None
for line in fileIn:
if targetEndPattern is not None:
if line == targetEndPattern:
targetEndPattern = None
else:
fileOut.write(line + "\n")
elif line in patterns:
targetEndPattern = patterns[line]
EDIT: If you are expecting the patterns in a certain order, then this solution would have to be revised. I wrote this under the assumption that the order of the patterns doesn't matter but each start pattern matches a specific end pattern.
回答5:
i would go with a generator-based solution
#!/usr/bin/env python
start_patterns = ('PATTERN1', 'PATTERN2')
end_patterns = ('END PATTERN')
def section_with_bounds(gen):
section_in_play = False
for line in gen:
if line.startswith(start_patterns):
section_in_play = True
if section_in_play:
yield line
if line.startswith(end_patterns):
section_in_play = False
with open("text.t2") as f:
gen = section_with_bounds(f)
for line in gen:
print line
来源:https://stackoverflow.com/questions/11156259/how-to-grep-lines-between-two-patterns-in-a-big-file-with-python