I am trying to create a dictionary from a list in pyspark. I have the following list of lists:
rawPositions
Gives
[[100979
Check in Spark Configuration https://spark.apache.org/docs/latest/configuration.html#loading-default-configurations Runtime Environment part.
When running:
$SPARK_HOME/bin/spark-submit
Add:
--conf spark.executorEnv.PYTHONHASHSEED=321
Since Python 3.2.3+ hash of str
, byte
and datetime
objects in Python is salted using random value to prevent certain kinds of denial-of-service attacks. It means that hash values are consistent inside single interpreter session but differ from session to session. PYTHONHASHSEED
sets RNG seed to provide a consistent value between session.
You can easily check this in your shell. If PYTHONHASHSEED
is not set you'll get some random values:
unset PYTHONHASHSEED
for i in `seq 1 3`;
do
python3 -c "print(hash('foo'))";
done
## -7298483006336914254
## -6081529125171670673
## -3642265530762908581
but when it is set you'll get the same value on each execution:
export PYTHONHASHSEED=323
for i in `seq 1 3`;
do
python3 -c "print(hash('foo'))";
done
## 8902216175227028661
## 8902216175227028661
## 8902216175227028661
Since groupBy
and other operations which depend on default partitioner use hashing you need the same value of PYTHONHASHSEED
on all machines in the cluster to get consistent results.
See also: