Is there a corpora of English words in nltk?

后端 未结 2 1102
不思量自难忘°
不思量自难忘° 2020-12-30 05:59

Is there any way to get the list of English words in python nltk library? I tried to find it but the only thing I have found is wordnet from nltk.corpus

相关标签:
2条回答
  • 2020-12-30 06:14

    Other than the nltk.corpus.words that @salvadordali has highlighted,:

    >>> from nltk.corpus import words
    >>> print words.readme()
    Wordlists
    
    en: English, http://en.wikipedia.org/wiki/Words_(Unix)
    en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
    >>> print words.words()[:10]
    [u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron']
    

    Do note that nltk.corpus.words is a list of words without frequencies so it's not exactly a corpora of natural text.

    The corpus package that contains various corpora, some of which are English corpora, see http://www.nltk.org/nltk_data/. E.g. nltk.corpus.brown:

    >>> from nltk.corpus import brown
    >>> brown.words()[:10]
    [u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']
    

    To get a word list from a natural text corpus:

    >>> wordlist = set(brown.words())
    >>> print len(wordlist)
    56057
    >>> wordlist_lowercased = set(i.lower() for i in brown.words())
    >>> print len(wordlist_lowercased)
    49815
    

    Note that the brown.words() contains words with both lower and upper cases like natural text.

    In most cases, a list of words is not very useful without frequencies, so you can use the FreqDist:

    >>> from nltk import FreqDist
    >>> from nltk.corpus import brown
    >>> frequency_list = FreqDist(i.lower() for i in brown.words())
    >>> frequency_list.most_common()[:10]
    [(u'the', 69971), (u',', 58334), (u'.', 49346), (u'of', 36412), (u'and', 28853), (u'to', 26158), (u'a', 23195), (u'in', 21337), (u'that', 10594), (u'is', 10109)]
    

    For more, see http://www.nltk.org/book/ch01.html on how to access corpora and process them in NLTK

    0 讨论(0)
  • 2020-12-30 06:33

    Yes, from nltk.corpus import words

    And check using:

    >>> "fine" in words.words()
    True
    

    Reference: Section 4.1 (Wordlist Corpora), chapter 2 of Natural Language Processing with Python.

    0 讨论(0)
提交回复
热议问题