Is there a limit on the number of classes in mllib NaiveBayes? Error calling model.save()

a 夏天 提交于 2019-12-02 01:30:16

Hard limitations:

Number of features * Number of classes has to be lower Integer.MAX_VALUE (231 - 1). You are nowhere near these value.

Soft limitations:

Theta matrix (conditional probabilities) is of size Number of features * Number of classes. Theta is stored both locally on the driver (as a part of the model) and serialized and send to the workers. This means that all machines require at least enough memory to serialize or deserialize and store the result.

Since you use default settings for HashingTF.numFeatures (220) each additional class adds 262144 - it is not that much, but quickly adds up. Based on the partial traceback you've posted, it looks like the failing component is Kryo serializer. The same traceback also suggests the solution, which is increasing spark.kryoserializer.buffer.max.

You can also try using standard Java serialization by setting:

 spark.serializer org.apache.spark.serializer.JavaSerializer 

Since you use PySpark with pyspark.ml and pyspark.sql it might be acceptable without significant performance loss.

Configuration aside I would focus on the feature engineering component. Using binary CountVetorizer (see note about HashingTF below) with ChiSqSelector might provide one way to both increase interpretability and effectively reduce number of features. You may also consider more sophisticated approaches (determine feature importances and applying Naive Bayes only on a subset of data, more advanced text processing like lemmatization / stemming, or using some variant of autoencoder to get more compact vector representation).

Notes:

  • Please keep in mind that multinational Naive Bayes considers only binary features. NaiveBayes will handle this internally, but I would still recommend using setBinary for clarity.
  • Arguably HashingTF is rather useless here. Hash collisions aside, highly sparse features and essentially meaningless features, make it poor choice as a preprocessing step for NaiveBayes.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!