How to unquote a urlencoded unicode string in python?

后端未结

关注

 5  1591

I have a unicode string like \"Tanım\" which is encoded as \"Tan%u0131m\" somehow. How can i convert this encoded string back to original unicode. Apparently urllib.unquote

相关标签:

5条回答

灰色年华

2020-11-29 00:41

def unquote(text):
    def unicode_unquoter(match):
        return unichr(int(match.group(1),16))
    return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text)

0 讨论(0)

臣服心动

2020-11-29 00:42

You have a URL using a non-standard encoding scheme, rejected by standards bodies but still being produced by some encoders. The Python urllib.parse.unquote() function can't handle these.

Creating your own decoder is not that hard, luckily. %uhhhh entries are meant to be UTF-16 codepoints here, so we need to take surrogate pairs into account. I've also seen %hh codepoints mixed in, for added confusion.

With that in mind, here is a decoder which works in both Python 2 and Python 3, provided you pass in a str object in Python 3 (Python 2 cares less):

try:
    # Python 3
    from urllib.parse import unquote
    unichr = chr
except ImportError:
    # Python 2
    from urllib import unquote

def unquote_unicode(string, _cache={}):
    string = unquote(string)  # handle two-digit %hh components first
    parts = string.split(u'%u')
    if len(parts) == 1:
        return parts
    r = [parts[0]]
    append = r.append
    for part in parts[1:]:
        try:
            digits = part[:4].lower()
            if len(digits) < 4:
                raise ValueError
            ch = _cache.get(digits)
            if ch is None:
                ch = _cache[digits] = unichr(int(digits, 16))
            if (
                not r[-1] and
                u'\uDC00' <= ch <= u'\uDFFF' and
                u'\uD800' <= r[-2] <= u'\uDBFF'
            ):
                # UTF-16 surrogate pair, replace with single non-BMP codepoint
                r[-2] = (r[-2] + ch).encode(
                    'utf-16', 'surrogatepass').decode('utf-16')
            else:
                append(ch)
            append(part[4:])
        except ValueError:
            append(u'%u')
            append(part)
    return u''.join(r)

The function is heavily inspired by the current standard-library implementation.

Demo:

>>> print(unquote_unicode('Tan%u0131m'))
Tanım
>>> print(unquote_unicode('%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4'))
איך ממירים את הטקסט הזה
>>> print(unquote_unicode('%ud83c%udfd6'))  # surrogate pair


          	          
            
           
            
                              
                
              
              
                
                  执念已碎        
                
              
                            
                2020-11-29 00:44
              
            
            
                                                                       
there is a bug in the above version where it freaks out sometimes when there are both ascii encoded and unicode encoded characters in the string. I think its specifically when there are characters from the upper 128 range like '\xab' in addition to unicode.

eg. "%5B%AB%u03E1%BB%5D" causes this error.

I found if you just did the unicode ones first, the problem went away:

def unquote_u(source):
  result = source
  if '%u' in result:
    result = result.replace('%u','\\u').decode('unicode_escape')
  result = unquote(result)
  return result

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-11-29 00:53
              
            
            
                                                                       
%uXXXX is a non-standard encoding scheme that has been rejected by the w3c, despite the fact that an implementation continues to live on in JavaScript land.

The more common technique seems to be to UTF-8 encode the string and then % escape the resulting bytes using %XX. This scheme is supported by urllib.unquote:

>>> urllib2.unquote("%0a")
'\n'


Unfortunately, if you really need to support %uXXXX, you will probably have to roll your own decoder. Otherwise, it is likely to be far more preferable to simply UTF-8 encode your unicode and then % escape the resulting bytes.

A more complete example:

>>> u"Tanım"
u'Tan\u0131m'
>>> url = urllib.quote(u"Tanım".encode('utf8'))
>>> urllib.unquote(url).decode('utf8')
u'Tan\u0131m'

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉梦人生        
                
              
                            
                2020-11-29 01:00
              
            
            
                                                                       
This will do it if you absolutely have to have this (I really do agree with the cries of "non-standard"):

from urllib import unquote

def unquote_u(source):
    result = unquote(source)
    if '%u' in result:
        result = result.replace('%u','\\u').decode('unicode_escape')
    return result

print unquote_u('Tan%u0131m')

> Tanım

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...