How to remove unicode from string?

前端未结

关注

 4  1021

I have a string like:

q <-\"  1000-66329\"

I want to remove and get only 1000 66329


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2020-11-27 23:57
              
            
            
                                                                       
If  always is the first character, you can try:

substring("\U00A6 1000-66B29", 2)


if R prints the string as <U+00A6>  1000-66329 instead of ¦ 1000-66B29 then <U+00A6> is interpreted as the string "<U+00A6>" instead of the unicode character. Then you can do:

substring("<U+00A6>  1000-66329",9)


Both ways the result is:

[1] "  1000-66329"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2020-11-28 00:14
              
            
            
                                                                       
Instead of removing you should convert it to the appropriate format ... You have to set your local to UTF-8 like so:

Sys.setlocale("LC_CTYPE", "en_US.UTF-8")


Maybe you will see the following message:

Warning message:
In Sys.setlocale("LC_CTYPE", "en_US.UTF-8") :
  OS reports request to set locale to "en_US.UTF-8" cannot be honored


In this case you should use stringi::stri_trans_general(x, "zh")

Here "zh" means "chinese". You should know which language you have to convert to. That's it
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  故里飘歌        
                
              
                            
                2020-11-28 00:16
              
            
            
                                                                       

  I just want to remove unicode <U+00A6> which is at the beginning of string. 


Then you do not need a gsub, you can use a sub with "^\\s*<U\\+\\w+>\\s*" pattern:

q <-"<U+00A6>  1000-66329"
sub("^\\s*<U\\+\\w+>\\s*", "", q)


Pattern details:


^ - start of string
\\s* - zero or more whitespaces
<U\\+ - a literal char sequence <U+
\\w+ - 1 or more letters, digits or underscores
> - a literal >
\\s*  - zero or more whitespaces.


If you also need to replace the - with a space, add |- alternative and use gsub (since now we expect several replacements and the replacement must be a space - same is in akrun's answer):

trimws(gsub("^\\s*<U\\+\\w+>|-", " ", q))


See the R online demo
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2020-11-28 00:16
              
            
            
                                                                       
We can also do

trimws(gsub("\\S+\\s+|-", " ", q))
#[1] "1000 66329"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复