Python to show special characters

前端 未结 3 1345
北荒
北荒 2020-12-18 07:28

I know there are tons of threads regarding this issue but I have not managed to find one which solves my problem.

I am trying to print a string but when printed it d

相关标签:
3条回答
  • 2020-12-18 07:38

    The contents of the strings are not unicode, they are UTF-8 encoded.

    >>> print u'Von D\xc3\xbc'
    Von Dü
    >>> print 'Von D\xc3\xbc'
    Von Dü
    
    >>> print unicode('Von D\xc3\xbc', 'utf-8')
    Von Dü
    >>> 
    

    Edit:

    >>> print '\xc3\x96berg' # no unicode identifier, works as expected because it's an UTF-8 encoded string
    Öberg
    >>> print u'\xc3\x96berg' # has unicode identifier, means print uses the unicode charset now, outputs weird stuff
    Ãberg
    
    # Look at the differing object types:
    >>> type('\xc3\x96berg')
    <type 'str'>
    >>> type(u'\xc3\x96berg')
    <type 'unicode'>
    
    >>> '\xc3\x96berg'.decode('utf-8') # this command converts from UTF-8 to unicode, look at the unicode identifier in the output
    u'\xd6berg'
    >>> unicode('\xc3\x96berg', 'utf-8') # this does the same thing
    u'\xd6berg'
    >>> unicode(u'foo bar', 'utf-8') # trying to convert a unicode string to unicode will fail as expected
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: decoding Unicode is not supported
    
    0 讨论(0)
  • 2020-12-18 07:58

    Prevention is better than cure. What you need is to find out how that rubbish is being created. Please edit your question to show the code that creates it, and then we can help you fix it. It looks like somebody has done:

    your_unicode_string =  original_utf8_encoded_bytestring.decode('latin1')
    

    The cure is to reverse the process, simply, and then decode.

    correct_unicode_string = your_unicode_string.encode('latin1').decode('utf8')
    

    Update Based on the code that you supplied, the probable cause is that the website declares that it is encoded in ISO-8859-1 (aka latin1) but in reality it is encoded in UTF-8. Please update your question to show us the url.

    If you can't show it, read the BS docs; it looks like you'll need to use:

    BeautifulSoup(web, from_encoding='utf8')
    
    0 讨论(0)
  • 2020-12-18 07:58

    Unicode support in many languages is confusing, so your error here is understandable. Those strings are UTF-8 bytes, which would work properly if you drop the u at the front:

    >>> err = u'\xc3\x96berg'
    >>> print err
    Ã?berg
    >>> x = '\xc3\x96berg'
    >>> print x
    Öberg
    >>> u = x.decode('utf-8')
    >>> u
    u'\xd6berg'
    >>> print u
    Öberg
    

    For lots more information:

    http://www.joelonsoftware.com/articles/Unicode.html

    http://docs.python.org/howto/unicode.html


    You should really really read those links and understand what is going on before proceeding. If, however, you absolutely need to have something that works today, you can use this horrible hack that I am embarrassed to post publicly:

    def convert_fake_unicode_to_real_unicode(string):
        return ''.join(map(chr, map(ord, string))).decode('utf-8')
    
    0 讨论(0)
提交回复
热议问题