pickle.PicklingError: args[0] from newobj args has the wrong class with hadoop python

前端未结

关注

 3  942

I am trying to I am tring to delete stop words via spark,the code is as follow

from nltk.corpus import stopwords
from pyspark.context import SparkContext
fro


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  萌比男神i        
                
              
                            
                2021-01-13 09:18
              
            
            
                                                                       
It's to do with uploading of stop words module. As a work around import stopwords library with in the function itself. please see the similar issue linked below.
I had same the issue and this work around fixed the problem. 

    def stopwords_delete(word_list):
        from nltk.corpus import stopwords
        filtered_words=[]
        print word_list


Similar Issue

I would recommend from pyspark.ml.feature import StopWordsRemover as permanent fix.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2021-01-13 09:25
              
            
            
                                                                       
Probably, it's just because you are defining the stopwords.words('english') every time on the executor. Define it outside and this would work.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2021-01-13 09:26
              
            
            
                                                                       
You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type  is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,

filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())
print(filtered_words)


I got this output as filtered_words,

["shan't", "she'd", 'fuck', 'world', "who's"]


Also, include a return in your function.

Another way, you could use list comprehension to replace the stopwords_delete fuction,

filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复

pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python

pickle.PicklingError: args[0] from newobj args has the wrong class with hadoop python