R tm package invalid input in 'utf8towcs'

前端未结

关注

 14  1336

I\'m trying to use the tm package in R to perform some text analysis. I tied the following:

require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <


                      
              相关标签:


      
      
        
          14条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  别那么骄傲        
                
              
                            
                2020-11-29 02:20
              
            
            
                                                                       
The official FAQ seems to be not working in my situation:

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))


Finally I made it using the for & Encoding function:

for (i in 1:length(dataSet))
{
  Encoding(corpus[[i]])="UTF-8"
}
corpus <- tm_map(dataSet, tolower)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-11-29 02:23
              
            
            
                                                                       
If it's alright to ignore invalid inputs, you could use R's error handling. e.g:

  dataSet <- Corpus(DirSource('tmp/'))
  dataSet <- tm_map(dataSet, function(data) {
     #ERROR HANDLING
     possibleError <- tryCatch(
         tolower(data),
         error=function(e) e
     )

     # if(!inherits(possibleError, "error")){
     #   REAL WORK. Could do more work on your data here,
     #   because you know the input is valid.
     #   useful(data); fun(data); good(data);
     # }
  }) 


There is an additional example here: http://gastonsanchez.wordpress.com/2012/05/29/catching-errors-when-using-tolower/
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复