Frequency of ngrams (strings) in tokenized text

て烟熏妆下的殇ゞ 提交于 2019-12-20 03:45:06

问题


I have a set of unique ngrams (list called ngramlist) and ngram tokenized text (list called ngrams). I want to construct a new vector, freqlist, where each element of freqlist is the fraction of ngrams that is equal to that element of ngramlist. I wrote the following code that gives the correct output, but I wonder if there is a way to optimize it:

freqlist = [
    sum(int(ngram == ngram_condidate)
        for ngram_condidate in ngrams) / len(ngrams)
    for ngram in ngramlist
]

I imagine there is a function in nltk or elsewhere that does this faster but I am not sure which one.

Thanks!

Edit: for what it's worth the ngrams are producted as joined output of nltk.util.ngrams and ngramlist is just a list made from set of all found ngrams.

Edit2:

Here is reproducible code to test the freqlist line (the rest of the code is not really what I care about)

from nltk.util import ngrams
import wikipedia
import nltk
import pandas as pd

articles = ['New York City','Moscow','Beijing']
tokenizer  = nltk.tokenize.TreebankWordTokenizer()

data={'article':[],'treebank_tokenizer':[]}
for article in articles:
    data['article' ].append(wikipedia.page(article).content)
    data['treebank_tokenizer'].append(tokenizer.tokenize(data['article'][-1]))

df=pd.DataFrame(data)

df['ngrams-3']=df['treebank_tokenizer'].map(
    lambda x: [' '.join(t) for t in ngrams(x,3)])

ngramlist = list(set([trigram for sublist in df['ngrams-3'].tolist() for trigram in sublist]))

df['freqlist']=df['ngrams-3'].map(lambda ngrams_: [sum(int(ngram==ngram_condidate) for ngram_condidate in ngrams_)/len(ngrams_) for ngram in ngramlist])

回答1:


You can probably optimize this a bit by pre-computing some quantities and using a Counter. This will be especially useful if most of the elements in ngramlist are contained in ngrams.

freqlist = [
    sum(int(ngram == ngram_candidate)
            for ngram_candidate in ngrams) / len(ngrams)
        for ngram in ngramlist
]

You certainly don't need to iterate over ngrams every single time you check an ngram. One pass over ngrams will make this algorighm O(n) instead of the O(n2) one you have now. Remember, shorter code is not necessarily better or more efficient code:

from collections import Counter
...

counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]

To use this function properly, you would have to write a def function instead of a lambda:

def count_ngrams(ngrams):
    counter = Counter(ngrams)
    size = len(ngrams)
    freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
    return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)



回答2:


Firstly, don't pollute your imported functions by overriding them and using them as variables, keep the ngrams name as the function, and use something else as variable.

import time
from functools import partial
from itertools import chain
from collections import Counter

import wikipedia

import pandas as pd

from nltk import word_tokenize
from nltk.util import ngrams

Next the steps before the line you're asking in the original question might be a little inefficient, you can clean them up, make them easier to read and measure them as such:

# Downloading the articles.
titles = ['New York City','Moscow','Beijing']
start = time.time()
df = pd.DataFrame({'article':[wikipedia.page(title).content for title in titles]})
end = time.time()
print('Downloading wikipedia articles took', end-start, 'seconds')

And then:

# Tokenizing the articles
start = time.time()
df['tokens'] = df['article'].apply(word_tokenize)
end = time.time()
print('Tokenizing articles took', end-start, 'seconds')

Then:

# Extracting trigrams.
trigrams = partial(ngrams, n=3)
start = time.time()
# There's no need to flatten them to strings, you could just use list()
df['trigrams'] = df['tokens'].apply(lambda x: list(trigrams(x)))
end = time.time()
print('Extracting trigrams took', end-start, 'seconds')

Finally, to the last line

# Instead of a set, we use a Counter here because 
# we can use an intersection between Counter objects later.
# see https://stackoverflow.com/questions/44012479/intersection-of-two-counters
all_trigrams = Counter(chain(*df['trigrams']))

# More often than not, you don't need to keep all the 
# zeros in the vectors (aka dense vector), 
# you could actually get the non-zero sparse vectors 
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)

# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
    nonzero_features = Counter(list_of_ngrams) & all_trigrams
    total = len(list_of_ngrams)
    return {ng:count/total for ng, count in nonzero_features.items()}

df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)


来源:https://stackoverflow.com/questions/49620764/frequency-of-ngrams-strings-in-tokenized-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!