Spark add new fitted stage to a exitsting PipelineModel without fitting again

问题

I have a saved PipelineModel:

pipe_model = pipe.fit(df_train)
pipe_model.write().overwrite().save("/user/pipe_text_2")

And now I want to add to this Pipe a new already fited PipelineModel:

pipe_model = PipelineModel.load("/user/pipe_text_2")
df2 = pipe_model.transform(df1)

kmeans = KMeans(k=20)
pipe2 = Pipeline(stages=[kmeans])
pipe_model2 = pipe2.fit(df2)

Is that possible without fitting it again? In order to obtain a new PipelineModel but not a new Pipeline. The ideal thing would be the following:

pipe_model_new = pipe_model + pipe_model2
TypeError: unsupported operand type(s) for +: 'PipelineModel' and 'PipelineModel'

I've found Join two Spark mllib pipelines together but with this solution you need to fit the whole Pipe again. That is what I'm trying to avoid.

回答1:

Since PipelineModels are valid stages for a PipelieModel class, you should be able to use this which does not require fiting again:

pipe_model_new = PipelineModel(stages = [pipe_model , pipe_model2])
final_df = pipe_model_new.transform(df1)

来源：https://stackoverflow.com/questions/49337830/spark-add-new-fitted-stage-to-a-exitsting-pipelinemodel-without-fitting-again

标签

apache-spark

pyspark

apache-spark-mllib

pipeline

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!