Real-time data standardization / normalization with Spark structured streaming

大憨熊 提交于 2020-01-13 19:00:46

问题


Standardizing / normalizing data is an essential, if not a crucial, point when it comes to implementing machine learning algorithms. Doing so on a real time manner using Spark structured streaming has been a problem I've been trying to tackle for the past couple of weeks.

Using the StandardScaler estimator ((value(i)-mean) /standard deviation) on historical data proved to be great, and in my use case it is the best, to get reasonable clustering results, but I'm not sure how to fit StandardScaler model with real-time data. Structured streaming does not allow it. Any advice would be highly appreciated!

In other words, how to fit models in Spark structured streaming?


回答1:


I got an answer for this. It's not possible at the moment to do real time machine learning with Spark structured streaming, inluding normalization; however, for some algorithms making real time predictions is possible if an offline model was built/fitted.

Check:

JIRA - Add support for Structured Streaming to the ML Pipeline API

Google DOC - Machine Learning on Structured Streaming



来源:https://stackoverflow.com/questions/44074903/real-time-data-standardization-normalization-with-spark-structured-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!