What does Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED mean in pyspark?

前端 未结 2 1950
长发绾君心
长发绾君心 2021-01-12 01:11

I am trying to create a dictionary from a list in pyspark. I have the following list of lists:

rawPositions

Gives

[[100979         


        
相关标签:
2条回答
  • 2021-01-12 01:36

    Check in Spark Configuration https://spark.apache.org/docs/latest/configuration.html#loading-default-configurations Runtime Environment part.

    When running:

    $SPARK_HOME/bin/spark-submit
    

    Add:

    --conf spark.executorEnv.PYTHONHASHSEED=321
    
    0 讨论(0)
  • 2021-01-12 01:41

    Since Python 3.2.3+ hash of str, byte and datetime objects in Python is salted using random value to prevent certain kinds of denial-of-service attacks. It means that hash values are consistent inside single interpreter session but differ from session to session. PYTHONHASHSEED sets RNG seed to provide a consistent value between session.

    You can easily check this in your shell. If PYTHONHASHSEED is not set you'll get some random values:

    unset PYTHONHASHSEED
    for i in `seq 1 3`;
      do
        python3 -c "print(hash('foo'))";
      done
    
    ## -7298483006336914254
    ## -6081529125171670673
    ## -3642265530762908581
    

    but when it is set you'll get the same value on each execution:

    export PYTHONHASHSEED=323
    for i in `seq 1 3`;
      do
        python3 -c "print(hash('foo'))";
      done
    
    ## 8902216175227028661
    ## 8902216175227028661
    ## 8902216175227028661
    

    Since groupBy and other operations which depend on default partitioner use hashing you need the same value of PYTHONHASHSEED on all machines in the cluster to get consistent results.

    See also:

    • Python Setup and Usage » Command line and environment
    • oCERT 2011-003 multiple implementations denial-of-service via hash algorithm collision
    0 讨论(0)
提交回复
热议问题