What does Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED mean in pyspark?

前端未结

关注

 2  1959

I am trying to create a dictionary from a list in pyspark. I have the following list of lists:

rawPositions

Gives

[[100979


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2021-01-12 01:36
              
            
            
                                                                       
Check in Spark Configuration https://spark.apache.org/docs/latest/configuration.html#loading-default-configurations Runtime Environment part.

When running:

$SPARK_HOME/bin/spark-submit


Add:

--conf spark.executorEnv.PYTHONHASHSEED=321

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2021-01-12 01:41
              
            
            
                                                                       
Since Python 3.2.3+ hash of str, byte and datetime objects in Python is salted using random value to prevent certain kinds of denial-of-service attacks. It means that hash values are consistent inside single interpreter session but differ from session to session. PYTHONHASHSEED sets RNG seed to provide a consistent value between session.

You can easily check this in your shell. If PYTHONHASHSEED is not set you'll get some random values:

unset PYTHONHASHSEED
for i in `seq 1 3`;
  do
    python3 -c "print(hash('foo'))";
  done

## -7298483006336914254
## -6081529125171670673
## -3642265530762908581


but when it is set you'll get the same value on each execution:

export PYTHONHASHSEED=323
for i in `seq 1 3`;
  do
    python3 -c "print(hash('foo'))";
  done

## 8902216175227028661
## 8902216175227028661
## 8902216175227028661


Since groupBy and other operations which depend on default partitioner use hashing you need the same value of PYTHONHASHSEED on all machines in the cluster to get consistent results.

See also:


Python Setup and Usage » Command line and environment
oCERT 2011-003 multiple implementations denial-of-service via hash algorithm collision

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复