UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

后端 未结 29 2798
余生分开走
余生分开走 2020-11-21 04:43

I\'m having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.

The problem is that

相关标签:
29条回答
  • 2020-11-21 04:43

    In general case of writing this unsupported encoding string (let's say data_that_causes_this_error) to some file (for e.g. results.txt), this works

    f = open("results.txt", "w")
      f.write(data_that_causes_this_error.encode('utf-8'))
      f.close()
    
    0 讨论(0)
  • 2020-11-21 04:45

    Below solution worked for me, Just added

    u "String"

    (representing the string as unicode) before my string.

    result_html = result.to_html(col_space=1, index=False, justify={'right'})
    
    text = u"""
    <html>
    <body>
    <p>
    Hello all, <br>
    <br>
    Here's weekly summary report.  Let me know if you have any questions. <br>
    <br>
    Data Summary <br>
    <br>
    <br>
    {0}
    </p>
    <p>Thanks,</p>
    <p>Data Team</p>
    </body></html>
    """.format(result_html)
    
    0 讨论(0)
  • 2020-11-21 04:46

    Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.

    def safeStr(obj):
        try: return str(obj)
        except UnicodeEncodeError:
            return obj.encode('ascii', 'ignore').decode('ascii')
        except: return ""
    

    Testing it:

    if __name__ == '__main__': 
        print safeStr( 1 ) 
        print safeStr( "test" ) 
        print u'98\xb0'
        print safeStr( u'98\xb0' )
    

    Results:

    1
    test
    98°
    98
    

    UPDATE: My original answer was written for Python 2. For Python 3:

    def safeStr(obj):
        try: return str(obj).encode('ascii', 'ignore').decode('ascii')
        except: return ""
    

    Note: if you'd prefer to leave a ? indicator where the "unsafe" unicode characters are, specify replace instead of ignore in the call to encode for the error handler.

    Suggestion: you might want to name this function toAscii instead? That's a matter of preference...

    Finally, here's a more robust PY2/3 version using six, where I opted to use replace, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:

    from six import PY2, iteritems 
    
    CHAR_SWAP = { u'\u201c': u'"'
                , u'\u201D': u'"' 
                , u'\u2018': u"'" 
                , u'\u2019': u"'" 
    }
    
    def toAscii( text ) :    
        try:
            for k,v in iteritems( CHAR_SWAP ): 
                text = text.replace(k,v)
        except: pass     
        try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
        except UnicodeEncodeError:
            return text.encode('ascii', 'replace').decode('ascii')
        except: return ""
    
    if __name__ == '__main__':     
        print( toAscii( u'testin\u2019' ) )
    
    0 讨论(0)
  • 2020-11-21 04:48

    I've actually found that in most of my cases, just stripping out those characters is much simpler:

    s = mystring.decode('ascii', 'ignore')
    
    0 讨论(0)
  • 2020-11-21 04:49

    This is a classic python unicode pain point! Consider the following:

    a = u'bats\u00E0'
    print a
     => batsà
    

    All good so far, but if we call str(a), let's see what happens:

    str(a)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
    

    Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:

    a.encode('utf-8')
     => 'bats\xc3\xa0'
    print a.encode('utf-8')
     => batsà
    

    Voil\u00E0!

    The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.

    For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html

    0 讨论(0)
  • 2020-11-21 04:49

    This will work:

     >>>print(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", "bats\xc3\xa0")).encode('ascii', 'ignore'))
    

    Output:

    >>>bats
    
    0 讨论(0)
提交回复
热议问题