How to update a ML model during a spark streaming job without restarting the application?

走远了吗. 提交于 2019-12-24 10:27:37

问题


I've got a Spark Streaming job whose goal is to :

  • read a batch of messages
  • predict a variable Y given these messages using a pre-trained ML pipeline

The problem is, I'd like to be able to update the model used by the executors without restarting the application.

Simply put, here's what it looks like :

model = #model initialization

def preprocess(keyValueList):
    #do some preprocessing

def predict(preprocessedRDD):
    if not preprocessedRDD.isEmpty():
        df = #create df from rdd
        df = model.transform(df)
        #more things to do

stream = KafkaUtils.createDirectStream(ssc, [kafkaTopic], kafkaParams)

stream.mapPartitions(preprocess).foreachRDD(predict)

In this case, the model is simply used. Not updated.

I've thought about several possibilities but I have now crossed them all out :

  • broadcasting the model everytime it changes (cannot update it, read-only)
  • reading the model from HDFS on the executors (it needs the SparkContext so not possible)

Any idea ?

Thanks a lot !


回答1:


I've solved this issue before in two different ways:

  • a TTL on the model
  • rereading the model on each batch

Both those solutions suppose an additional job training on the data you've accumulated regularly (e.g. once a day).




回答2:


The function you pass to foreachRDD is executed by the driver, it's only the rdd operations themselves that are performed by executors, as such you don't need to serialize the model - assuming you are using a Spark ML pipeline which operates on RDD's, which as far as I know they all do. Spark handles the training/prediction for you, you don't need to manually distribute it.



来源:https://stackoverflow.com/questions/43387114/how-to-update-a-ml-model-during-a-spark-streaming-job-without-restarting-the-app

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!