问题
I have a corpus and I have a word. For each occurrence of the word in the corpus I want to get a list containing the k words before and the k words after the word. I am doing this algorithmically OK (see below) but I wondered whether NLTK is providing some functionality for my needs that I missed?
def sized_context(word_index, window_radius, corpus):
""" Returns a list containing the window_size amount of words to the left
and to the right of word_index, not including the word at word_index.
"""
max_length = len(corpus)
left_border = word_index - window_radius
left_border = 0 if word_index - window_radius < 0 else left_border
right_border = word_index + 1 + window_radius
right_border = max_length if right_border > max_length else right_border
return corpus[left_border:word_index] + corpus[word_index+1: right_border]
回答1:
The simplest, nltk-ish way to do this is with nltk.ngrams()
.
words = nltk.corpus.brown.words()
k = 5
for ngram in nltk.ngrams(words, 2*k+1, pad_left=True, pad_right=True, pad_symbol=" "):
if ngram[k+1].lower() == "settle":
print(" ".join(ngram))
pad_left
and pad_right
ensure that all words get looked at. This is important if you don't let your concordances span across sentences (hence: lots of boundary cases).
If you want to ignore punctuation in the window size, you can strip it before scanning:
words = (w for w in nltk.corpus.brown.words() if re.search(r"\w", w))
回答2:
If you want to use the nltk's functionality, you can use nltk's ConcordanceIndex
. In order to base the width of the display on the number of words instead of the number of characters (the latter being the default for ConcordanceIndex.print_concordance
), you can merely create a subclass of ConcordanceIndex
with something like this:
from nltk import ConcordanceIndex
class ConcordanceIndex2(ConcordanceIndex):
def create_concordance(self, word, token_width=13):
"Returns a list of contexts for @word with a context <= @token_width"
half_width = token_width // 2
contexts = []
for i, token in enumerate(self._tokens):
if token == word:
start = i - half_width if i >= half_width else 0
context = self._tokens[start:i + half_width + 1]
contexts.append(context)
return contexts
Then you can obtain your results like this:
>>> from nltk.tokenize import wordpunct_tokenize
>>> my_corpus = 'The gerenuk fled frantically across the vast valley, whereas the giraffe merely turned indignantly and clumsily loped away from the valley into the nearby ravine.' # my corpus
>>> tokens = wordpunct_tokenize(my_corpus)
>>> c = ConcordanceIndex2(tokens)
>>> c.create_concordance('valley') # returns a list of lists, since words may occur more than once in a corpus
[['gerenuk', 'fled', 'frantically', 'across', 'the', 'vast', 'valley', ',', 'whereas', 'the', 'giraffe', 'merely', 'turned'], ['and', 'clumsily', 'loped', 'away', 'from', 'the', 'valley', 'into', 'the', 'nearby', 'ravine', '.']]
The create_concordance
method I created above is based upon the nltk's ConcordanceIndex.print_concordance
method, which works like this:
>>> c = ConcordanceIndex(tokens)
>>> c.print_concordance('valley')
Displaying 2 of 2 matches:
valley , whereas the giraffe merely turn
and clumsily loped away from the valley into the nearby ravine .
来源:https://stackoverflow.com/questions/22118136/nltk-find-contexts-of-size-2k-for-a-word