Base word stemming instead of root word stemming in R

前端未结

关注

 4  448

Is there any way to get base word instead of root word in stemming using NLP in R?

Code:

> #Loading libraries
> library(tm)
> library(slam)
>


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2021-02-05 18:29
              
            
            
                                                                       
stemCompletion could be used here. It's not the best one but manageable.

Stemm = tm_map(Txt, stemCompletion, dictionary=Txtt)
inspect(Stemm)

A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happyness happies happies

[[2]]
sky sky

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2021-02-05 18:35
              
            
            
                                                                       
When I needed to do something similar, I wrote out my list of words in a text file, and fed it to the English Lexicon Project's web query tool, then parsed the result back into R. A little clunky, but lots of good data is available from ELP. 
For your use, Check out ELP's MorphSP.  For happiness, it gives {happy}>ness>

http://elexicon.wustl.edu/query14/query14.asp
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2021-02-05 18:40
              
            
            
                                                                       
You're probably looking for a stemmer.
Here are some stemmers from CRAN Task View: Natural Language Processing:


RWeka is a interface to Weka which is a collection of machine learning algorithms for data mining tasks written in Java. Especially useful in the context of natural language processing is its functionality for tokenization and stemming.
Snowball provides the Snowball stemmers which contain the Porter stemmer and several other stemmers for different languages. See the Snowball webpage for details.
Rstem is an alternative interface to a C version of Porter's word stemming algorithm.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2021-02-05 18:44
              
            
            
                                                                       
Without a good knowledge of English morphology, you would have to use an existing library rather than create your own stemmer. 

English is full of unexpected morphological surprises that would affect both probabilistic and rule-based models. Some examples are:


Removing an in- prefix to remove an -able suffix, as in inhabitable.
Change of the word's category, as in the noun bicycle resulting from stemming the verb bicycling (can affect rules based on categories).
Words with negative meanings cannot take negative prefixes (you can have unpretty, but not unugly).
Two words as a compound, as in "truck driver" (you would treat them as one word when you stem).


English also has an issue with I-umlaut, where words like men, geese, feet, best, and a host of other words (all with an 'e'-like sound) cannot be easily stemmed. Stemming foreign, borrowed words, like automaton, may also be an issue.

Stemming the superlative form is a good example of exceptions:

best -> good

eldest -> old 

A lemmatizer would account for such exceptions, but would be slower. You can look at the Porter stemmer rules to get an idea of what you need, or you can just use its SnowballC R package.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复