pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python

前端 未结 3 942
死守一世寂寞
死守一世寂寞 2021-01-13 08:44

I am trying to I am tring to delete stop words via spark,the code is as follow

from nltk.corpus import stopwords
from pyspark.context import SparkContext
fro         


        
相关标签:
3条回答
  • 2021-01-13 09:18

    It's to do with uploading of stop words module. As a work around import stopwords library with in the function itself. please see the similar issue linked below. I had same the issue and this work around fixed the problem.

        def stopwords_delete(word_list):
            from nltk.corpus import stopwords
            filtered_words=[]
            print word_list
    

    Similar Issue

    I would recommend from pyspark.ml.feature import StopWordsRemover as permanent fix.

    0 讨论(0)
  • 2021-01-13 09:25

    Probably, it's just because you are defining the stopwords.words('english') every time on the executor. Define it outside and this would work.

    0 讨论(0)
  • 2021-01-13 09:26

    You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,

    filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())
    print(filtered_words)
    

    I got this output as filtered_words,

    ["shan't", "she'd", 'fuck', 'world', "who's"]
    

    Also, include a return in your function.

    Another way, you could use list comprehension to replace the stopwords_delete fuction,

    filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()
    
    0 讨论(0)
提交回复
热议问题