Is there a limit on the number of classes in mllib NaiveBayes? Error calling model.save()

I am trying to train a model to predict the category of text input data. I am running into what seems to be numerical instability using the pyspark.ml.classification.NaiveBayes classifier on a bag-of-words when the number of classes is above a certain amount.

In my real world project, I have on the order of ~1bn records and ~50 classes. I am able to train my model and make predictions but I get an error when I try to save it using model.save(). Operationally, this is annoying since I have to retrain my model each time from scratch.

In trying to debug, I scaled my data down to around ~10k rows and had the same issue trying to save. However saving works fine if I reduce the number of class labels.

This leads me to believe that there is a limit to the number of labels. I am not able to reproduce my exact issues, but the code below is related. If I set num_labels to anything greater than 31, model.fit() throws an error.

My questions:

Is there a limit to the number of classes in the mllib implementation of NaiveBayes?
What could be some reasons that I am not able to save my model if I can successfully use it to make predictions?
If there is indeed a limit, would it be possible to split my data into groups of smaller classes, train separate models, and combine?

Full Working Example

Create some dummy data.

I'm going to use nltk.corpus.comparitive_sentences and nltk.corpus.sentence_polarity. Keep in mind that this is just an illustrative example with nonsense data - I'm not concerned with the performance of the fitted model.

import pandas as pd
from pyspark.sql.types import StringType

# create some dummy data
from nltk.corpus import comparative_sentences, sentence_polarity
df = pd.DataFrame(
    {
        'sentence': [" ".join(s) for s in cs.sents() + sp.sents()]
    }
)

# assign a 'category' to each row
num_labels = 31  # seems to be the upper limit
df['category'] = (df.index%num_labels).astype(str)

# make it into a spark dataframe
spark_df = sqlCtx.createDataFrame(df)

Data Preparation Pipeline

from pyspark.ml.feature import NGram, Tokenizer, StopWordsRemover
from pyspark.ml.feature import HashingTF, IDF, StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vector

indexer = StringIndexer(inputCol='category', outputCol='label')
tokenizer = Tokenizer(inputCol="sentence", outputCol="sentence_tokens")
remove_stop_words = StopWordsRemover(inputCol="sentence_tokens", outputCol="filtered")
unigrammer = NGram(n=1, inputCol="filtered", outputCol="tokens") 
hashingTF = HashingTF(inputCol="tokens", outputCol="hashed_tokens")
idf = IDF(inputCol="hashed_tokens", outputCol="tf_idf_tokens")

clean_up = VectorAssembler(inputCols=['tf_idf_tokens'], outputCol='features')

data_prep_pipe = Pipeline(
    stages=[indexer, tokenizer, remove_stop_words, unigrammer, hashingTF, idf, clean_up]
)
transformed = data_prep_pipe.fit(spark_df).transform(spark_df)
clean_data = transformed.select(['label','features'])

Train the model

from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
(training,testing) = clean_data.randomSplit([0.7,0.3], seed=12345)
model = nb.fit(training)
test_results = model.transform(testing)

Evaluate Model

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting label was: {}".format(acc))

On my machine, this prints:

Accuracy of model at predicting label was: 0.0305764788269

Error Message

If I change num_labels to 32 or higher, this is the error I get when I call model.fit():

Py4JJavaError: An error occurred while calling o1336.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 86.0 failed 4 times, most recent failure: Lost task 0.3 in stage 86.0 (TID 1984, someserver.somecompany.net, executor 22): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 7, required: 8 Serialization trace: values (org.apache.spark.ml.linalg.DenseVector). To avoid this, increase spark.kryoserializer.buffer.max value. ... ... blah blah blah more java stuff that goes on forever

Notes

In this example, if I add a feature for bigrams, the error happens if num_labels > 15. I wonder if it is coincidence that this is also 1 less than a power of 2.
In my real-world project, I also get an error when trying to call model.theta. (I don't think the errors themselves are meaningful - they are just the exceptions passed back from the java/scala methods.)

Hard limitations:

Number of features * Number of classes has to be lower Integer.MAX_VALUE (2³¹ - 1). You are nowhere near these value.

Soft limitations:

Theta matrix (conditional probabilities) is of size Number of features * Number of classes. Theta is stored both locally on the driver (as a part of the model) and serialized and send to the workers. This means that all machines require at least enough memory to serialize or deserialize and store the result.

Since you use default settings for HashingTF.numFeatures (2²⁰) each additional class adds 262144 - it is not that much, but quickly adds up. Based on the partial traceback you've posted, it looks like the failing component is Kryo serializer. The same traceback also suggests the solution, which is increasing spark.kryoserializer.buffer.max.

You can also try using standard Java serialization by setting:

 spark.serializer org.apache.spark.serializer.JavaSerializer

Since you use PySpark with pyspark.ml and pyspark.sql it might be acceptable without significant performance loss.

Configuration aside I would focus on the feature engineering component. Using binary CountVetorizer (see note about HashingTF below) with ChiSqSelector might provide one way to both increase interpretability and effectively reduce number of features. You may also consider more sophisticated approaches (determine feature importances and applying Naive Bayes only on a subset of data, more advanced text processing like lemmatization / stemming, or using some variant of autoencoder to get more compact vector representation).

Notes:

Please keep in mind that multinational Naive Bayes considers only binary features. NaiveBayes will handle this internally, but I would still recommend using setBinary for clarity.
Arguably HashingTF is rather useless here. Hash collisions aside, highly sparse features and essentially meaningless features, make it poor choice as a preprocessing step for NaiveBayes.

来源：https://stackoverflow.com/questions/48234474/is-there-a-limit-on-the-number-of-classes-in-mllib-naivebayes-error-calling-mod

标签

python

apache-spark

pyspark

naivebayes

apache-spark-ml