Online (incremental) logistic regression in Spark [duplicate]

笑着哭i 提交于 2020-01-15 08:16:10

问题


In Spark MLlib (RDD-based API) there is the StreamingLogisticRegressionWithSGD for incremental training of a Logistic Regression model. However, this class has been deprecated and offers little functionality (eg no access to model coefficients and output probabilities).

In Spark ML (DataFrame-based API) I only find the class LogisticRegression, having only the fit method for batch training. This doesn't allow for a pattern of model-saving, reloading and incremental training.

Needless to say some applications benefit greatly from incremental learning. Is there any solution available in Spark?


回答1:


In Spark ML, when you call LogisticRegression.fit() you get a LogisticRegressionModel. You can then add the LogisticRegressionModel to a Pipeline and save/load the pipeline for incremental training.

val lr = new LogisticRegression()
val pipeline = new Pipeline().setStages(Array(lr))
model = pipeline.fit(data)
model.write.overwrite().save("/tmp/saved_model")

If you want to train the model with streaming data or apply it to streaming data, you can define a Structured Streaming dataframe and pass it to the pipeline.

For example (taken from the Spark docs):

// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
  .readStream
  .option("sep", ";")
  .schema(userSchema)      // Specify schema of the csv files
  .csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")


来源:https://stackoverflow.com/questions/50026823/online-incremental-logistic-regression-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!