问题
I have a set of unique ngrams (list called ngramlist) and ngram tokenized text (list called ngrams). I want to construct a new vector, freqlist, where each element of freqlist is the fraction of ngrams that is equal to that element of ngramlist. I wrote the following code that gives the correct output, but I wonder if there is a way to optimize it:
freqlist = [
sum(int(ngram == ngram_condidate)
for ngram_condidate in ngrams) / len(ngrams)
for ngram in ngramlist
]
I imagine there is a function in nltk or elsewhere that does this faster but I am not sure which one.
Thanks!
Edit: for what it's worth the ngrams are producted as joined output of nltk.util.ngrams and ngramlist
is just a list made from set of all found ngrams.
Edit2:
Here is reproducible code to test the freqlist line (the rest of the code is not really what I care about)
from nltk.util import ngrams
import wikipedia
import nltk
import pandas as pd
articles = ['New York City','Moscow','Beijing']
tokenizer = nltk.tokenize.TreebankWordTokenizer()
data={'article':[],'treebank_tokenizer':[]}
for article in articles:
data['article' ].append(wikipedia.page(article).content)
data['treebank_tokenizer'].append(tokenizer.tokenize(data['article'][-1]))
df=pd.DataFrame(data)
df['ngrams-3']=df['treebank_tokenizer'].map(
lambda x: [' '.join(t) for t in ngrams(x,3)])
ngramlist = list(set([trigram for sublist in df['ngrams-3'].tolist() for trigram in sublist]))
df['freqlist']=df['ngrams-3'].map(lambda ngrams_: [sum(int(ngram==ngram_condidate) for ngram_condidate in ngrams_)/len(ngrams_) for ngram in ngramlist])
回答1:
You can probably optimize this a bit by pre-computing some quantities and using a Counter. This will be especially useful if most of the elements in ngramlist
are contained in ngrams
.
freqlist = [
sum(int(ngram == ngram_candidate)
for ngram_candidate in ngrams) / len(ngrams)
for ngram in ngramlist
]
You certainly don't need to iterate over ngrams
every single time you check an ngram
. One pass over ngrams
will make this algorighm O(n)
instead of the O(n2)
one you have now. Remember, shorter code is not necessarily better or more efficient code:
from collections import Counter
...
counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
To use this function properly, you would have to write a def
function instead of a lambda
:
def count_ngrams(ngrams):
counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)
回答2:
Firstly, don't pollute your imported functions by overriding them and using them as variables, keep the ngrams
name as the function, and use something else as variable.
import time
from functools import partial
from itertools import chain
from collections import Counter
import wikipedia
import pandas as pd
from nltk import word_tokenize
from nltk.util import ngrams
Next the steps before the line you're asking in the original question might be a little inefficient, you can clean them up, make them easier to read and measure them as such:
# Downloading the articles.
titles = ['New York City','Moscow','Beijing']
start = time.time()
df = pd.DataFrame({'article':[wikipedia.page(title).content for title in titles]})
end = time.time()
print('Downloading wikipedia articles took', end-start, 'seconds')
And then:
# Tokenizing the articles
start = time.time()
df['tokens'] = df['article'].apply(word_tokenize)
end = time.time()
print('Tokenizing articles took', end-start, 'seconds')
Then:
# Extracting trigrams.
trigrams = partial(ngrams, n=3)
start = time.time()
# There's no need to flatten them to strings, you could just use list()
df['trigrams'] = df['tokens'].apply(lambda x: list(trigrams(x)))
end = time.time()
print('Extracting trigrams took', end-start, 'seconds')
Finally, to the last line
# Instead of a set, we use a Counter here because
# we can use an intersection between Counter objects later.
# see https://stackoverflow.com/questions/44012479/intersection-of-two-counters
all_trigrams = Counter(chain(*df['trigrams']))
# More often than not, you don't need to keep all the
# zeros in the vectors (aka dense vector),
# you could actually get the non-zero sparse vectors
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)
# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
nonzero_features = Counter(list_of_ngrams) & all_trigrams
total = len(list_of_ngrams)
return {ng:count/total for ng, count in nonzero_features.items()}
df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)
来源:https://stackoverflow.com/questions/49620764/frequency-of-ngrams-strings-in-tokenized-text