I\'m having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that
Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.
def safeStr(obj):
try: return str(obj)
except UnicodeEncodeError:
return obj.encode('ascii', 'ignore').decode('ascii')
except: return ""
Testing it:
if __name__ == '__main__':
print safeStr( 1 )
print safeStr( "test" )
print u'98\xb0'
print safeStr( u'98\xb0' )
Results:
1
test
98°
98
UPDATE: My original answer was written for Python 2. For Python 3:
def safeStr(obj):
try: return str(obj).encode('ascii', 'ignore').decode('ascii')
except: return ""
Note: if you'd prefer to leave a ?
indicator where the "unsafe" unicode characters are, specify replace
instead of ignore
in the call to encode for the error handler.
Suggestion: you might want to name this function toAscii
instead? That's a matter of preference...
Finally, here's a more robust PY2/3 version using six
, where I opted to use replace
, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:
from six import PY2, iteritems
CHAR_SWAP = { u'\u201c': u'"'
, u'\u201D': u'"'
, u'\u2018': u"'"
, u'\u2019': u"'"
}
def toAscii( text ) :
try:
for k,v in iteritems( CHAR_SWAP ):
text = text.replace(k,v)
except: pass
try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
except UnicodeEncodeError:
return text.encode('ascii', 'replace').decode('ascii')
except: return ""
if __name__ == '__main__':
print( toAscii( u'testin\u2019' ) )