问题
I've got a Spark Streaming job whose goal is to :
- read a batch of messages
- predict a variable Y given these messages using a pre-trained ML pipeline
The problem is, I'd like to be able to update the model used by the executors without restarting the application.
Simply put, here's what it looks like :
model = #model initialization
def preprocess(keyValueList):
#do some preprocessing
def predict(preprocessedRDD):
if not preprocessedRDD.isEmpty():
df = #create df from rdd
df = model.transform(df)
#more things to do
stream = KafkaUtils.createDirectStream(ssc, [kafkaTopic], kafkaParams)
stream.mapPartitions(preprocess).foreachRDD(predict)
In this case, the model is simply used. Not updated.
I've thought about several possibilities but I have now crossed them all out :
- broadcasting the model everytime it changes (cannot update it, read-only)
- reading the model from HDFS on the executors (it needs the SparkContext so not possible)
Any idea ?
Thanks a lot !
回答1:
I've solved this issue before in two different ways:
- a TTL on the model
- rereading the model on each batch
Both those solutions suppose an additional job training on the data you've accumulated regularly (e.g. once a day).
回答2:
The function you pass to foreachRDD is executed by the driver, it's only the rdd operations themselves that are performed by executors, as such you don't need to serialize the model - assuming you are using a Spark ML pipeline which operates on RDD's, which as far as I know they all do. Spark handles the training/prediction for you, you don't need to manually distribute it.
来源:https://stackoverflow.com/questions/43387114/how-to-update-a-ml-model-during-a-spark-streaming-job-without-restarting-the-app