问题
I have a dataframe with text and category. I want to count the words which are common in these categories. I am using nltk to remove the stop words and tokenize however not able to include the category in the process. Below is my sample code of the problem.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession,Row
import nltk
spark_conf = SparkConf()\
.setAppName("test")
sc=SparkContext.getOrCreate(spark_conf)
def wordTokenize(x):
words = [word for line in x for word in line.split()]
return words
def rmstop(x):
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))
word = [w for w in x if not w in stop_words]
return word
# in actual problem I have a file which I am reading as a dataframe
# so creating a dataframe first
df = [('Happy','I am so happy today'),
('Happy', 'its my birthday'),
('Happy', 'lets have fun'),
('Sad', 'I am going to die today'),
('Neutral','I am going to office today'),('Neutral','This is my house')]
rdd = sc.parallelize(df)
rdd_data = rdd.map(lambda x: Row(Category=x[0], text=x[1]))
df_data = sqlContext.createDataFrame(rdd_data)
#convert to rdd for nltk process
df_data_rdd = df_data.select('text').rdd.flatMap(lambda x: x)
#make it lower and sentence tokenize
df_data_rdd1 = df_data_rdd.map(lambda x : x.lower())\
.map(lambda x: nltk.sent_tokenize(x))
#word tokenize
data_rdd1_words = df_data_rdd1.map(wordTokenize)
#stop word and distinct
data_rdd1_words_clean = data_rdd1_words.map(rmstop)\
.flatMap(lambda x: x)\
.distinct()
data_rdd1_words_clean.collect()
output : ['today', 'birthday', 'lets', 'die', 'house', 'happy', 'fun', 'going', 'office']
I want to count word(after the preprocessing) frequencies with respect to categories. For example: "today" : 3 as it is present in all three categories.
回答1:
Here, extractphraseRDD is an RDD that contains your phrases. So the below code will count the number of words and display them in ascending order.
freqDistRDD = extractphraseRDD.flatMap(lambda x : nltk.FreqDist(x).most_common()).map(lambda x: x).reduceByKey(lambda x,y : x+y).sortBy(lambda x: x[1], ascending = False)
来源:https://stackoverflow.com/questions/61853908/pyspark-rdd-word-calculate