I am trying to I am tring to delete stop words via spark,the code is as follow
from nltk.corpus import stopwords
from pyspark.context import SparkContext
fro
It's to do with uploading of stop words module. As a work around import stopwords library with in the function itself. please see the similar issue linked below. I had same the issue and this work around fixed the problem.
def stopwords_delete(word_list):
from nltk.corpus import stopwords
filtered_words=[]
print word_list
Similar Issue
I would recommend from pyspark.ml.feature import StopWordsRemover
as permanent fix.
Probably, it's just because you are defining the stopwords.words('english') every time on the executor. Define it outside and this would work.
You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,
filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())
print(filtered_words)
I got this output as filtered_words,
["shan't", "she'd", 'fuck', 'world', "who's"]
Also, include a return in your function.
Another way, you could use list comprehension to replace the stopwords_delete fuction,
filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()