Using my own corpus instead of movie_reviews corpus for Classification in NLTK

前端 未结 1 910
眼角桃花
眼角桃花 2020-12-01 10:15

I use following code and I get it form Classification using movie review corpus in NLTK/Python

import string
from itertools import chain
from nltk.corpus imp         


        
相关标签:
1条回答
  • 2020-12-01 10:50

    If you have you data in exactly the same structure as the movie_review corpus in NLTK, there are two ways to "hack" your way through:

    1. Put your corpus directory into where you save the nltk.data

    First check where is your nltk.data saved:

    >>> import nltk
    >>> nltk.data.find('corpora/movie_reviews')
    FileSystemPathPointer(u'/home/alvas/nltk_data/corpora/movie_reviews')
    

    Then move your directory to where the location where nltk_data/corpora is saved:

    # Let's make a test corpus like `nltk.corpus.movie_reviews`
    ~$ mkdir my_movie_reviews
    ~$ mkdir my_movie_reviews/pos
    ~$ mkdir my_movie_reviews/neg
    ~$ echo "This is a great restaurant." > my_movie_reviews/pos/1.txt
    ~$ echo "Had a great time at chez jerome." > my_movie_reviews/pos/2.txt
    ~$ echo "Food fit for the ****" > my_movie_reviews/neg/1.txt
    ~$ echo "Slow service." > my_movie_reviews/neg/2.txt
    ~$ echo "README please" > my_movie_reviews/README
    # Move it to `nltk_data/corpora/`
    ~$ mv my_movie_reviews/ nltk_data/corpora/
    

    In your python code:

    >>> import string
    >>> from nltk.corpus import LazyCorpusLoader, CategorizedPlaintextCorpusReader
    >>> from nltk.corpus import stopwords
    >>> my_movie_reviews = LazyCorpusLoader('my_movie_reviews', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
    >>> mr = my_movie_reviews
    >>>
    >>> stop = stopwords.words('english')
    >>> documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
    >>> for i in documents:
    ...     print i
    ... 
    ([u'Food', u'fit', u'****'], u'neg')
    ([u'Slow', u'service'], u'neg')
    ([u'great', u'restaurant'], u'pos')
    ([u'great', u'time', u'chez', u'jerome'], u'pos')
    

    (For more details, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/util.py#L21 and https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L144)

    2. Create your own CategorizedPlaintextCorpusReader

    If you have no access to nltk.data directory and you want to use your own corpus, try this:

    # Let's say that your corpus is saved on `/home/alvas/my_movie_reviews/`
    
    >>> import string; from nltk.corpus import stopwords
    >>> from nltk.corpus import CategorizedPlaintextCorpusReader
    >>> mr = CategorizedPlaintextCorpusReader('/home/alvas/my_movie_reviews', r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
    >>> stop = stopwords.words('english')
    >>> documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
    >>> 
    >>> for doc in documents:
    ...     print doc
    ... 
    ([u'Food', u'fit', u'****'], 'neg')
    ([u'Slow', u'service'], 'neg')
    ([u'great', u'restaurant'], 'pos')
    ([u'great', u'time', u'chez', u'jerome'], 'pos')
    

    Similar questions has been asked on Creating a custom categorized corpus in NLTK and Python and Using my own corpus for category classification in Python NLTK


    Here's the full code that will work:

    import string
    from itertools import chain
    
    from nltk.corpus import stopwords
    from nltk.probability import FreqDist
    from nltk.classify import NaiveBayesClassifier as nbc
    from nltk.corpus import CategorizedPlaintextCorpusReader
    import nltk
    
    mydir = '/home/alvas/my_movie_reviews'
    
    mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
    stop = stopwords.words('english')
    documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
    
    word_features = FreqDist(chain(*[i for i,j in documents]))
    word_features = word_features.keys()[:100]
    
    numtrain = int(len(documents) * 90 / 100)
    train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
    test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]
    
    classifier = nbc.train(train_set)
    print nltk.classify.accuracy(classifier, test_set)
    classifier.show_most_informative_features(5)
    
    0 讨论(0)
提交回复
热议问题