How can I create a TF-IDF for Text Classification using Spark?

我只是一个虾纸丫 提交于 2019-11-30 04:24:08

To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of

document_id, [token_ids]

The second is an inverted index like

token_id, [document_ids]

I'll call those corpus and inv_index respectively.

To get tf we need to count the number of occurrences of each token in each document. So

from collections import Counter
def wc_per_row(row):
    cnt = Counter()
    for word in row:
        cnt[word] += 1
    return cnt.items() 

tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))

The df is simply the length of each term's inverted index. From that we can calculate the idf.

df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()

# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df. 
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()

Now we just have to do a join on the term_id:

def calc_tfidf(tf_tuples, idf_tuples):
    return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
        (k2, v2) in idf_tuples if k1 == k2]

tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))

This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.

And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.

If anyone can improve on this, I'm very interested.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!