Escaping unicode strings in python

前端未结

关注

 4  1433

In python these three commands print the same emoji:

print \"\\xF0\\x9F\\x8C\\x80\"


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  清酒与你        
                
              
                            
                2021-01-03 04:56
              
            
            
                                                                       
See Unicode Literals in Python Source Code


  In Python source code, Unicode literals are written as strings prefixed with the ‘u’ or ‘U’ character: u'abcdefghijk'. Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.


In [1]: "\xF0\x9F\x8C\x80".decode('utf-8')
Out[1]: u'\U0001f300'

In [2]: u'\U0001F300'.encode('utf-8')
Out[2]: '\xf0\x9f\x8c\x80'

In [3]: u'\ud83c\udf00'.encode('utf-8')
Out[3]: '\xf0\x9f\x8c\x80'




\uhhhh     --> Unicode character with 16-bit hex value  
\Uhhhhhhhh --> Unicode character with 32-bit hex value



  In Unicode escapes, the first form gives four hex digits to
  encode a 2-byte (16-bit) character code point, and the second gives eight hex digits for a 4-byte (32-bit) code point. Byte strings support only hex escapes for encoded text and other forms of byte-based data

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2021-01-03 05:00
              
            
            
                                                                       
The first one is a byte string:

>>> "\xF0\x9F\x8C\x80".decode('utf8')
u'\U0001f300'


The u"\ud83c\udf00" one is the UTF16 version (four digit unicode escape)

The u"\U0001F300" one is actual index of the codepoint.  



But how do the numbers relate?  This is the difficult question.  It's defined by the encoding and there is no obvious relationship.  To give you an idea, here is an example of "manually" encoding the codepoint at index 0x1F300 into UTF-8:

The cyclone character                                                                     

                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2021-01-03 05:00
              
            
            
                                                                       
Your first string is a byte string. The fact that it prints a single emoji character means that your console is configured to print UTF-8 encoded characters.

Your second string is a Unicode string with a single codepoint, U+1F300. The \U specifies that the next 8 hex digits should be interpreted as a codepoint.

The third string takes advantage of a quirk in the way Unicode strings are stored in Python 2. You've given two UTF-16 entities, which together form the single codepoint U+1F300 the same as the previous string. Each \u takes 4 following hex digits. Individually these characters wouldn't be valid Unicode, but because Python 2 stores its Unicode internally as UTF-16 it works out. In Python 3 this wouldn't be valid.

When you print out a Unicode string, and your console encoding is known to be UTF-8, the Unicode strings are encoded to UTF-8 bytes. Thus the 3 strings end up producing the same byte sequence on the output, generating the same character.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  陌清茗        
                
              
                            
                2021-01-03 05:09
              
            
            
                                                                       
The other answers describe how Unicode characters can be encoded or embedded as literals in Python 2.x. Let me answer your more meta question, "it's not obvious to me how \xF0\x9F and 0001 and d83c are the same number?"

The number assigned to each Unicode "code point"--roughly speaking, to each "character"--can be encoded in multiple ways. This is similar to how integers can be encoded in several ways:


0b1100100 (binary, base 2)
0144 (octal, base 8)
100 (decimal, base 10)
0x64 (hexadecimal, base 16)


Those are all the same value, decimal 100, with different encodings. The following is a true expression in Python:

0b1100100 == 0144 == 100 == 0x64


Unicode's encodings are a bit more complex, but the principle is the same. Just because the values don't look the same doesn't mean they don't signify the same value. In Python 2:

u'\ud83c\udf00' == u'\U0001F300' == "\xF0\x9F\x8C\x80".decode("utf-8")


Python 3 changes the rules for string literals, but it's still true that:

u'\U0001F300' == b"\xF0\x9F\x8C\x80".decode("utf-8") 


Where the explicit b (bytes prefix) is required. The u (Unicode prefix) is optional, as all strings are considered to contain Unicode, and the u is only permitted in 3.3 and later. The multi-byte combo characters...well, they weren't that pretty anyway, were they? 

So you presented various encodings of the Unicode CYCLONE code point, and the other answers showed some ways to move between code points. See this for even more encodings of that one character.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复