Is there a corpora of English words in nltk?

后端未结

关注

 2  1102

Is there any way to get the list of English words in python nltk library? I tried to find it but the only thing I have found is wordnet from nltk.corpus


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2020-12-30 06:14
              
            
            
                                                                       
Other than the nltk.corpus.words that @salvadordali has highlighted,:

>>> from nltk.corpus import words
>>> print words.readme()
Wordlists

en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
>>> print words.words()[:10]
[u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron']


Do note that nltk.corpus.words is a list of words without frequencies so it's not exactly a corpora of natural text.

The corpus package that contains various corpora, some of which are English corpora, see http://www.nltk.org/nltk_data/. E.g. nltk.corpus.brown:

>>> from nltk.corpus import brown
>>> brown.words()[:10]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']


To get a word list from a natural text corpus:

>>> wordlist = set(brown.words())
>>> print len(wordlist)
56057
>>> wordlist_lowercased = set(i.lower() for i in brown.words())
>>> print len(wordlist_lowercased)
49815


Note that the brown.words() contains words with both lower and upper cases like natural text.

In most cases, a list of words is not very useful without frequencies, so you can use the FreqDist:

>>> from nltk import FreqDist
>>> from nltk.corpus import brown
>>> frequency_list = FreqDist(i.lower() for i in brown.words())
>>> frequency_list.most_common()[:10]
[(u'the', 69971), (u',', 58334), (u'.', 49346), (u'of', 36412), (u'and', 28853), (u'to', 26158), (u'a', 23195), (u'in', 21337), (u'that', 10594), (u'is', 10109)]


For more, see http://www.nltk.org/book/ch01.html on how to access corpora and process them in NLTK
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2020-12-30 06:33
              
            
            
                                                                       
Yes, from nltk.corpus import words

And check using:

>>> "fine" in words.words()
True


Reference: Section 4.1 (Wordlist Corpora), chapter 2 of Natural Language Processing with Python.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复