Import statements taking time on spark executors (Pyspark Executors)

问题

I am developing a python prediction script using Spark (PySpark) streaming and Keras. The prediction is happening on the executor where I am calling model.predict().

Modules that I have imported are

from keras.layers.core import Dense, Activation, Dropout 
from keras.layers.recurrent import LSTM 
from keras.models import Sequential

I have checked and these imports are taking 2.5 seconds on Spark driver(2 cor + 2gb) to load. What is surprising for me is that each time executor gets the job, it automatically do these imports again. The reason I am sure that these imports are happening each time a job is submitted to the executor is because I see below statements in executor logs per job which are only comes when I do imports of above said modules.

/opt/conda/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype fromfloattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Using TensorFlow backend.

My target is to make the prediction in 1 sec but imports itself are taking 2.5 seconds (each time imports happens on the spark executor). Is this is the intended behavior? Is their anything I can do to minimize this time say in milliseconds ?

Update 1

For the past few days I was analyzing it and found out that there are 2 main issues.

I found a way to wrap keras model and pickle it on driver and unpickle on executor. So that improves time by 1 sec.
But on driver when mapPartition() occur for each batch the whole keras and tensorflow initialization happens each time on executor (for every job) which is taking 2.5 secs. Is there a way to initialize these imports once on each executor and not per job. May be some file in pyspark where can I put these imports(assuming that these files executed once when executor comes up)

来源：https://stackoverflow.com/questions/52715477/import-statements-taking-time-on-spark-executors-pyspark-executors

标签

apache-spark

tensorflow

keras

pyspark

spark-streaming