Remove unicode <+f0b7> from Corpus text

后端未结

关注

 1  1058

I\'m having a pretty stubborn issue... I can\'t seem to remove the <+f0b7> and <+f0a0> string from Corpora that have loaded from


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2021-01-16 22:09
              
            
            
                                                                       
Ok. The problem is that your data has an unusual unicode character in it. In R, we typically escape this character as "\uf0b7". But when inspect() prints it's data, it encodes it as "". Observe

sample<-c("Crazy \uf0b7 Character")
cp<-Corpus(VectorSource(sample))
inspect(DocumentTermMatrix(cp))

# A document-term matrix (1 documents, 3 terms)
# 
# Non-/sparse entries: 3/0
# Sparsity           : 0%
# Maximal term length: 9 
# Weighting          : term frequency (tf)
# 
#     Terms
# Docs <U+F0B7> character crazy
#    1        1         1     1


(actually i had to create this output on a Windows machine running R 3.0.2 - it worked fine on my Mac running R 3.1.0).

Unfortunately you will not be able to remove this with remove words because the regular expression used in that function required that word boundaries appear on both sides of the "word" and since this doesn't seem to be a recognized character for a boundary. See

gsub("\uf0b7","",sample)
# [1] "Crazy  Character"
gsub("\\b\uf0b7\\b","",sample)
#[1] "Crazy  Character"


So we can write our own function we can use with tm_map. Consider

removeCharacters <-function (x, characters)  {
gsub(sprintf("(*UCP)(%s)", paste(characters, collapse = "|")), "", x, perl = TRUE)
}


which is basically the removeWords function just without the boundary conditions. Then we can run

cp2 <- tm_map(cp, removeCharacters, c("\uf0b7","\uf0a0"))
inspect(DocumentTermMatrix(cp2))

# A document-term matrix (1 documents, 2 terms)
# 
# Non-/sparse entries: 2/0
# Sparsity           : 0%
# Maximal term length: 9 
# Weighting          : term frequency (tf)
# 
#     Terms
# Docs character crazy
#    1         1     1


and we see those unicode characters are no longer there.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复