naivebayes

Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

ε祈祈猫儿з 提交于 2019-12-02 05:14:57
问题 I'm running a Bernoulli Naive Bayes using code: val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L) val training = splits(0).cache() val test = splits(1) val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli") My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code: val numIterations = 100 val model = SVMWithSGD.train(training,

Why did NLTK NaiveBayes classifier misclassify one record?

半腔热情 提交于 2019-12-02 04:34:23
This is the first time I am building a sentiment analysis machine learning model using the nltk NaiveBayesClassifier in Python. I know it is too simple of a model, but it is just a first step for me and I will try tokenized sentences next time. The real issue I have with my current model is: I have clearly labeled the word 'bad' as negative in the training data set (as you can see from the 'negative_vocab' variable). However, when I ran the NaiveBayesClassifier on each sentence (lower case) in the list ['awesome movie', ' i like it', ' it is so bad'], the classifier mistakenly labeled 'it is

dimension mismatch error in CountVectorizer MultinomialNB

有些话、适合烂在心里 提交于 2019-12-02 04:17:23
问题 Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right. Ok, so I split my 'spam email' text data (originally in csv format) into training and test sets, using CountVectorizer and its 'fit_transform' function to fit the vocabulary of the corpus and extracts word count features from text. And then I applied MultinomialNB() to learn from training set and

SPARK ML, Naive Bayes classifier: high probability prediction for one class

孤街浪徒 提交于 2019-12-02 03:49:28
问题 I am using Spark ML to optimise a Naive Bayes multi-class classifier. I have about 300 categories and I am classifying text documents. The training set is balanced enough and there is about 300 training examples for each category. All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is

Is there a limit on the number of classes in mllib NaiveBayes? Error calling model.save()

假如想象 提交于 2019-12-02 02:46:23
问题 I am trying to train a model to predict the category of text input data. I am running into what seems to be numerical instability using the pyspark.ml.classification.NaiveBayes classifier on a bag-of-words when the number of classes is above a certain amount. In my real world project, I have on the order of ~1bn records and ~50 classes. I am able to train my model and make predictions but I get an error when I try to save it using model.save() . Operationally, this is annoying since I have to

Is there a limit on the number of classes in mllib NaiveBayes? Error calling model.save()

a 夏天 提交于 2019-12-02 01:30:16
I am trying to train a model to predict the category of text input data. I am running into what seems to be numerical instability using the pyspark.ml.classification.NaiveBayes classifier on a bag-of-words when the number of classes is above a certain amount. In my real world project, I have on the order of ~1bn records and ~50 classes. I am able to train my model and make predictions but I get an error when I try to save it using model.save() . Operationally, this is annoying since I have to retrain my model each time from scratch. In trying to debug, I scaled my data down to around ~10k rows

SPARK ML, Naive Bayes classifier: high probability prediction for one class

浪尽此生 提交于 2019-12-02 00:52:23
I am using Spark ML to optimise a Naive Bayes multi-class classifier. I have about 300 categories and I am classifying text documents. The training set is balanced enough and there is about 300 training examples for each category. All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero). What are

dimension mismatch error in CountVectorizer MultinomialNB

对着背影说爱祢 提交于 2019-12-01 22:53:46
Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right. Ok, so I split my 'spam email' text data (originally in csv format) into training and test sets, using CountVectorizer and its 'fit_transform' function to fit the vocabulary of the corpus and extracts word count features from text. And then I applied MultinomialNB() to learn from training set and predict on test set. Here is my code (simplified): from sklearn.feature_extraction.text import

Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

拟墨画扇 提交于 2019-12-01 22:46:27
I'm running a Bernoulli Naive Bayes using code: val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L) val training = splits(0).cache() val test = splits(1) val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli") My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code: val numIterations = 100 val model = SVMWithSGD.train(training, numIterations) model.clearThreshold() // Compute raw scores on the test set. val labelAndPreds = test

How to get feature Importance in naive bayes?

时间秒杀一切 提交于 2019-11-30 20:13:53
I have a dataset of reviews which has a class label of positive/negative. I am applying Naive Bayes to that reviews dataset. Firstly, I am converting into Bag of words. Here sorted_data['Text'] is reviews and final_counts is a sparse matrix count_vect = CountVectorizer() final_counts = count_vect.fit_transform(sorted_data['Text'].values) I am splitting the data into train and test dataset. X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_counts, labels, test_size=0.3, random_state=0) I am applying the naive bayes algorithm as follows optimal_alpha = 1 NB_optimal = BernoulliNB