Is there a limit on the number of classes in mllib NaiveBayes? Error calling model.save()

前端 未结 1 1199
花落未央
花落未央 2021-01-22 00:22

I am trying to train a model to predict the category of text input data. I am running into what seems to be numerical instability using the pyspark.ml.classification.Naive

相关标签:
1条回答
  • 2021-01-22 00:58

    Hard limitations:

    Number of features * Number of classes has to be lower Integer.MAX_VALUE (231 - 1). You are nowhere near these value.

    Soft limitations:

    Theta matrix (conditional probabilities) is of size Number of features * Number of classes. Theta is stored both locally on the driver (as a part of the model) and serialized and send to the workers. This means that all machines require at least enough memory to serialize or deserialize and store the result.

    Since you use default settings for HashingTF.numFeatures (220) each additional class adds 262144 - it is not that much, but quickly adds up. Based on the partial traceback you've posted, it looks like the failing component is Kryo serializer. The same traceback also suggests the solution, which is increasing spark.kryoserializer.buffer.max.

    You can also try using standard Java serialization by setting:

     spark.serializer org.apache.spark.serializer.JavaSerializer 
    

    Since you use PySpark with pyspark.ml and pyspark.sql it might be acceptable without significant performance loss.

    Configuration aside I would focus on the feature engineering component. Using binary CountVetorizer (see note about HashingTF below) with ChiSqSelector might provide one way to both increase interpretability and effectively reduce number of features. You may also consider more sophisticated approaches (determine feature importances and applying Naive Bayes only on a subset of data, more advanced text processing like lemmatization / stemming, or using some variant of autoencoder to get more compact vector representation).

    Notes:

    • Please keep in mind that multinational Naive Bayes considers only binary features. NaiveBayes will handle this internally, but I would still recommend using setBinary for clarity.
    • Arguably HashingTF is rather useless here. Hash collisions aside, highly sparse features and essentially meaningless features, make it poor choice as a preprocessing step for NaiveBayes.
    0 讨论(0)
提交回复
热议问题