Real-time data standardization / normalization with Spark structured streaming

问题

Standardizing / normalizing data is an essential, if not a crucial, point when it comes to implementing machine learning algorithms. Doing so on a real time manner using Spark structured streaming has been a problem I've been trying to tackle for the past couple of weeks.

Using the StandardScaler estimator ((value(i)-mean) /standard deviation) on historical data proved to be great, and in my use case it is the best, to get reasonable clustering results, but I'm not sure how to fit StandardScaler model with real-time data. Structured streaming does not allow it. Any advice would be highly appreciated!

In other words, how to fit models in Spark structured streaming?

回答1:

I got an answer for this. It's not possible at the moment to do real time machine learning with Spark structured streaming, inluding normalization; however, for some algorithms making real time predictions is possible if an offline model was built/fitted.

Check:

JIRA - Add support for Structured Streaming to the ML Pipeline API

Google DOC - Machine Learning on Structured Streaming

来源：https://stackoverflow.com/questions/44074903/real-time-data-standardization-normalization-with-spark-structured-streaming

标签

apache-spark