I am trying to train a model to predict the category of text input data. I am running into what seems to be numerical instability using the pyspark.ml.classification.Naive
Hard limitations:
Number of features * Number of classes has to be lower Integer.MAX_VALUE
(231 - 1). You are nowhere near these value.
Soft limitations:
Theta matrix (conditional probabilities) is of size Number of features * Number of classes. Theta is stored both locally on the driver (as a part of the model) and serialized and send to the workers. This means that all machines require at least enough memory to serialize or deserialize and store the result.
Since you use default settings for HashingTF.numFeatures
(220) each additional class adds 262144 - it is not that much, but quickly adds up. Based on the partial traceback you've posted, it looks like the failing component is Kryo serializer. The same traceback also suggests the solution, which is increasing spark.kryoserializer.buffer.max
.
You can also try using standard Java serialization by setting:
spark.serializer org.apache.spark.serializer.JavaSerializer
Since you use PySpark with pyspark.ml
and pyspark.sql
it might be acceptable without significant performance loss.
Configuration aside I would focus on the feature engineering component. Using binary CountVetorizer
(see note about HashingTF
below) with ChiSqSelector
might provide one way to both increase interpretability and effectively reduce number of features. You may also consider more sophisticated approaches (determine feature importances and applying Naive Bayes only on a subset of data, more advanced text processing like lemmatization / stemming, or using some variant of autoencoder to get more compact vector representation).
Notes:
NaiveBayes
will handle this internally, but I would still recommend using setBinary
for clarity.HashingTF
is rather useless here. Hash collisions aside, highly sparse features and essentially meaningless features, make it poor choice as a preprocessing step for NaiveBayes
.