问题
I currently work on a project which is simply creating basic corpus databases and tokenizes texts. But it seems I am stuck in a matter. Assume that we have those things:
import os, re
texts = []
for i in os.listdir(somedir): # Somedir contains text files which contain very large plain texts.
with open(i, 'r') as f:
texts.append(f.read())
Now I want to find the word before and after a token.
myToken = 'blue'
found = []
for i in texts:
fnd = re.findall('[a-zA-Z0-9]+ %s [a-zA-Z0-9]+|\. %s [a-zA-Z0-9]+|[a-zA-Z0-9]+ %s\.' %(myToken, myToken, myToken), i, re.IGNORECASE|re.UNICODE)
found.extend(fnd)
print myToken
for i in found:
print '\t\t%s' %(i)
I thought there would be three possibilities: The token might start sentence, the token might end sentence or the token might appear somewhere in the sentence, so I used the regex rule above. When I run, I come across those things:
blue
My blue car # What I exactly want.
he blue jac # That's not what I want. That must be "the blue jacket."
eir blue phone # Wrong! > their
a blue ali # Wrong! > alien
. Blue is # Okay.
is blue. # Okay.
...
I also tried \b\w\b or \b\W\b things, but unfortunately those did not return any results instead of returning wrong results. I tried:
'\b\w\b%s\b[a-zA-Z0-9]+|\.\b%s\b\w\b|\b\w\b%s\.'
'\b\W\b%s\b[a-zA-Z0-9]+|\.\b%s\b\W\b|\b\W\b%s\.'
I hope question is not too blur.
回答1:
I think what you want is:
- (Optionally) a word and a space;
- (Always)
'blue'
; - (Optionally) a space and a word.
Therefore one appropriate regex would be:
r'(?i)((?:\w+\s)?blue(?:\s\w+)?)'
For example:
>>> import re
>>> text = """My blue car
the blue jacket
their blue phone
a blue alien
End sentence. Blue is
is blue."""
>>> re.findall(r'(?i)((?:\w+\s)?{0}(?:\s\w+)?)'.format('blue'), text)
['My blue car', 'the blue jacket', 'their blue phone', 'a blue alien', 'Blue is', 'is blue']
See demo and token-by-token explanation here.
回答2:
Let's say token is test.
(?=^test\s+.*|.*?\s+test\s+.*?|.*?\s+test$).*
You can use lookahead.It will not eat up anything and at the same time validate as well.
http://regex101.com/r/wK1nZ1/2
回答3:
Regex can be sometimes slow (if not implemented correctly) and moreover accepted answer did not work for me in several cases.
So I went for the brute force solution (not saying it is the best one), where keyword can be composed of several words:
@staticmethod
def find_neighbours(word, sentence):
prepost_map = []
if word not in sentence:
return prepost_map
split_sentence = sentence.split(word)
for i in range(0, len(split_sentence) - 1):
prefix = ""
postfix = ""
prefix_list = split_sentence[i].split()
postfix_list = split_sentence[i + 1].split()
if len(prefix_list) > 0:
prefix = prefix_list[-1]
if len(postfix_list) > 0:
postfix = postfix_list[0]
prepost_map.append([prefix, word, postfix])
return prepost_map
Empty string before or after the keyword indicates that keyword was the first or the last word in the sentence, respectively.
来源:https://stackoverflow.com/questions/25199812/how-can-i-get-words-after-and-before-a-specific-token