Handling continuous data in Spark NaiveBayes

问题

As per official documentation of Spark NaiveBayes:

It supports Multinomial NB (see here) which can handle finitely supported discrete data.

How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes?

回答1:

The current implementation can process only binary features so for good result you'll have to discretize and encode your data. For discretization you can use either Buketizer or QuantileDiscretizer. The former one is less expensive and might be a better fit when you want to use some domain specific knowledge.

For encoding you can use dummy encoding using OneHotEncoder. with adjusted dropLast Param.

So overall you'll need:

QuantileDiscretizer or Bucketizer -> OneHotEncoder for each continuous feature.
StringIndexer* -> OneHotEncoder for each discrete feature.
VectorAssembler to combine all of the above.

* Or predefined column metadata.

来源：https://stackoverflow.com/questions/45626754/handling-continuous-data-in-spark-naivebayes

标签

apache-spark

apache-spark-mllib

naivebayes

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!