问题
I have to process a document in plain text, looking for a word list and returning a text window around each word found. I'm using NLTK.
I found posts on Stack Overflow where they use regular expressions for finding words, but without getting their index, just printing them. I don't think use RE is right, cause I have to find specific words.
回答1:
This is what you are looking for:
- You can either use str.index or str.find:
Contents of file:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi sollicitudin tortor et velit venenatis molestie. Morbi non nibh magna, quis tempor metus.
Vivamus vehicula velit sit amet neque posuere id hendrerit sem venenatis. Nam vitae felis sem. Mauris ultricies congue mi, eu ornare massa convallis nec.
Donec volutpat molestie velit, scelerisque porttitor dui suscipit vel. Etiam feugiat feugiat nisl, vitae commodo ligula tristique nec. Fusce bibendum fermentum rutrum.
>>>a = open("file.txt").read()
>>>print a.index("vitae")
232
>>> print a.find("vitae")
232
--Edit--
Ok, if you have same words in multiple indices try using a generator,
def all_occurences(file, str):
initial = 0
while True:
initial = file.find(str, initial)
if initial == -1: return
yield initial
initial += len(str)
>>>print list(all_occurences(open("file.txt").read(),"vitae"))
[232, 408]
回答2:
If I understand well, building a positional index is what you want
from collections import defaultdict
text = "your text goes here"
pos_index = defaultdict(list)
for pos, term in enumerate(text.split()):
pos_index[term].append(pos)
Now you have an index with each word's positions. Just query it by term..
回答3:
try this, where log
is the txt and word_search
is the term you are trying to index in log
[i for i, item in enumerate(log) if item == word_search]
回答4:
I know its been a while since you've asked the question, but since you're already using nltk I would sugest you to use its tool word_tokenize:
text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'
for index, word in enumerate(nltk.word_tokenize(text)):
print(index, word)
The result would be:
0 Lorem 1 ipsum 2 dolor 3 sit 4 amet 5 , 6 consectetur 7 adipiscing 8 elit 9 .
Hope it helps :)
来源:https://stackoverflow.com/questions/14307313/python-find-a-list-of-words-in-a-text-and-return-its-index