I'm doing a project on statistical machine translation in which I need to extract line numbers from a POS-tagged text file that match a regular expression (any non-separated phrasal verb with the particle 'out'), and write the line numbers to a file (in python).
I have this regular expression: '\w*_VB.?\sout_RP' and my POS-tagged text file: 'Corpus.txt'. I would like to get an output file with the line numbers that match the above-mentioned regular expression, and the output file should just have one line number per line (no empty lines), e.g.:
2
5
44
So far all I have in my script is the following:
OutputLineNumbers = open('OutputLineNumbers', 'w')
with open('Corpus.txt', 'r') as textfile:
phrase='\w*_VB.?\sout_RP'
for phrase in textfile:
OutputLineNumbers.close()
Any idea how to solve this problem?
In advance, thanks for your help!
This should solve your problem, presuming you have correct regex in variable 'phrase'
import re
# compile regex
regex = re.compile('[0-9]+')
# open the files
with open('Corpus.txt','r') as inputFile:
with open('OutputLineNumbers', 'w') as outputLineNumbers:
# loop through each line in corpus
for line_i, line in enumerate(inputFile, 1):
# check if we have a regex match
if regex.search( line ):
# if so, write it the output file
outputLineNumbers.write( "%d\n" % line_i )
you can do it directly with bash if your regular expression is grep friendly. show the line numbers using "-n"
for example:
grep -n "[1-9][0-9]" tags.txt
will output matching lines with the line numbers included at first
2569:vote2012
2570:30
2574:118
2576:7248
2578:2293
2580:9594
2582:577
来源:https://stackoverflow.com/questions/17076635/how-to-extract-lines-numbers-that-match-a-regular-expression-in-a-text-file