pyspark RDD word calculate

心已入冬 提交于 2020-05-28 11:53:25

问题


I have a dataframe with text and category. I want to count the words which are common in these categories. I am using nltk to remove the stop words and tokenize however not able to include the category in the process. Below is my sample code of the problem.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession,Row
import nltk
spark_conf = SparkConf()\
        .setAppName("test")
sc=SparkContext.getOrCreate(spark_conf) 

def wordTokenize(x):
    words = [word for line in x for word in line.split()]
    return words

def rmstop(x):
    from nltk.corpus import stopwords
    stop_words=set(stopwords.words('english'))
    word = [w for w in x if not w in stop_words]
    return word

# in actual problem I have a file which I am reading as a dataframe
# so creating a dataframe first 

df = [('Happy','I am so happy today'),
     ('Happy', 'its my birthday'),
     ('Happy', 'lets have fun'),
    ('Sad', 'I am going to die today'),
    ('Neutral','I am going to office today'),('Neutral','This is my house')]
rdd = sc.parallelize(df)
rdd_data = rdd.map(lambda x: Row(Category=x[0], text=x[1]))
df_data = sqlContext.createDataFrame(rdd_data)


#convert to rdd for nltk process
df_data_rdd = df_data.select('text').rdd.flatMap(lambda x: x)

#make it lower and sentence tokenize
df_data_rdd1 = df_data_rdd.map(lambda x : x.lower())\
.map(lambda x: nltk.sent_tokenize(x))

#word tokenize
data_rdd1_words   = df_data_rdd1.map(wordTokenize)

#stop word and distinct
data_rdd1_words_clean = data_rdd1_words.map(rmstop)\
.flatMap(lambda x: x)\
.distinct()

data_rdd1_words_clean.collect()

output : ['today', 'birthday', 'lets', 'die', 'house', 'happy', 'fun', 'going', 'office']

I want to count word(after the preprocessing) frequencies with respect to categories. For example: "today" : 3 as it is present in all three categories.


回答1:


Here, extractphraseRDD is an RDD that contains your phrases. So the below code will count the number of words and display them in ascending order.

freqDistRDD = extractphraseRDD.flatMap(lambda x : nltk.FreqDist(x).most_common()).map(lambda x: x).reduceByKey(lambda x,y : x+y).sortBy(lambda x: x[1], ascending = False)


来源:https://stackoverflow.com/questions/61853908/pyspark-rdd-word-calculate

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!