Python regex replacing \u2022

前端未结

关注

 4  1168

死守一世寂寞 2021-01-26 09:53

This is my string:

raw_list = u\'Software Engineer with a huge passion for new and innovative products. Experienced gained from working in both big and fast-grow


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   梦毁少年i
                                             
                
                
                (楼主)
            
              
              
                2021-01-26 10:20
              

            
            
                        
The key is to add the unicode u in front of the unicode character that you're trying to find - in this case the \u2022 which is the unicode character for a bullet.  If your text contains unicode characters then your text is actually unicode text as opposed to a string (you can confirm by printing out your text and looking for the u at the beginning).  See the below example, where I search for a unicode bullet character using regular expressions (RegEx) on both a string and unicode text: 

import regular expressions package:

import re


unicode text:

my_unicode = u"""\u2022 Here\'s a string of data.\n
\u2022 There are new 
line characters \n, HTML line break tags 
, and bullets \u2002 together in 
a sequence.\n
\u2022 Our goal is to use RegEx to identify the sequences."""

type(my_unicode) #unicode


string:

my_string = """\u2022 Here\'s a string of data. \n
\u2022There are new 
line characters \n, HTML line break tags 
, and bullets \u2002 together in 
a sequence.\n
\u2022 Our goal is to use RegEx to identify the sequences."""

type(my_string)     #string 


we successfully find the first piece of text that we're looking for which doesn't yet contain the unicode characters:

re.findall('\n
', my_unicode)

re.findall('\n
', my_string)


with the addition of the unicode character, neither substring can be found:

re.findall('\n
\u2022', my_unicode)

re.findall('\n
\u2022', my_string)


Adding four backslashes works for the string, but it does not work for the unicode text:

re.findall('\n
\\\\u', my_unicode)

re.findall('\n
\\\\u', my_string)


Solution: Include the unicode u in front of the unicode character:

re.findall('\n
' u'\u2022', my_unicode)

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复