Implementing Bag-of-Words Naive-Bayes classifier in NLTK

前端 未结 3 1644
无人共我
无人共我 2020-12-02 07:44

I basically have the same question as this guy.. The example in the NLTK book for the Naive Bayes classifier considers only whether a word occurs in a document as a feature.

相关标签:
3条回答
  • 2020-12-02 08:07
    • put the string you are looking at into a list, broken into words
    • for each item in the list, ask: is this item a feature I have in my feature list.
    • If it is, add the log prob as normal, if not, ignore it.

    If your sentence has the same word multiple times, it will just add the probs multiple times. If the word appears multiple times in the same class, your training data should reflect that in the word count.

    For added accuracy, count all bi-grams, tri-grams, etc as separate features.

    It helps to manually write your own classifiers so that you understand exactly what is happening and what you need to do to imporve accuracy. If you use a pre-packaged solution and it doesn't work well enough, there is not much you can do about it.

    0 讨论(0)
  • 2020-12-02 08:08

    The features in the NLTK bayes classifier are "nominal", not numeric. This means they can take a finite number of discrete values (labels), but they can't be treated as frequencies.

    So with the Bayes classifier, you cannot directly use word frequency as a feature-- you could do something like use the 50 more frequent words from each text as your feature set, but that's quite a different thing

    But maybe there are other classifiers in the NLTK that depend on frequency. I wouldn't know, but have you looked? I'd say it's worth checking out.

    0 讨论(0)
  • 2020-12-02 08:19

    scikit-learn has an implementation of multinomial naive Bayes, which is the right variant of naive Bayes in this situation. A support vector machine (SVM) would probably work better, though.

    As Ken pointed out in the comments, NLTK has a nice wrapper for scikit-learn classifiers. Modified from the docs, here's a somewhat complicated one that does TF-IDF weighting, chooses the 1000 best features based on a chi2 statistic, and then passes that into a multinomial naive Bayes classifier. (I bet this is somewhat clumsy, as I'm not super familiar with either NLTK or scikit-learn.)

    import numpy as np
    from nltk.probability import FreqDist
    from nltk.classify import SklearnClassifier
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.feature_selection import SelectKBest, chi2
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.pipeline import Pipeline
    
    pipeline = Pipeline([('tfidf', TfidfTransformer()),
                         ('chi2', SelectKBest(chi2, k=1000)),
                         ('nb', MultinomialNB())])
    classif = SklearnClassifier(pipeline)
    
    from nltk.corpus import movie_reviews
    pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')]
    neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')]
    add_label = lambda lst, lab: [(x, lab) for x in lst]
    classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg'))
    
    l_pos = np.array(classif.classify_many(pos[100:]))
    l_neg = np.array(classif.classify_many(neg[100:]))
    print "Confusion matrix:\n%d\t%d\n%d\t%d" % (
              (l_pos == 'pos').sum(), (l_pos == 'neg').sum(),
              (l_neg == 'pos').sum(), (l_neg == 'neg').sum())
    

    This printed for me:

    Confusion matrix:
    524     376
    202     698
    

    Not perfect, but decent, considering it's not a super easy problem and it's only trained on 100/100.

    0 讨论(0)
提交回复
热议问题