Comprehensive character replacement module in python for non-unicode and non-ascii for HTML

Is there a comprehensive character replacement module for python that finds all non-ascii or non-unicode characters in a string and replaces them with ascii or unicode equivilents? This comfort with the "ignore" argument during encoding or decoding is insane, but likewise so is a '?' in every place that a non translated character was.

I'm looking for one module that finds irksome characters and conforms them to whatever standard is requested. I realize that the amount of extant alphabets and encodings makes this somewhat impossible, but surely someone has taken a stab at it? Even a rudimentary solution would be better than the status quo.

The simplification for data transfer that this would mean is enormous.

i don't think what you want is really possible - but i think there is a decent option.

unicodedata has a 'normalize' method that can gracefully degrade text for you...

import unicodedata
def gracefully_degrade_to_ascii( text ):
    return unicodedata.normalize('NFKD',text).encode('ascii','ignore')

assuming the charset you're using is already mapped into unicode - or at least can be mapped into unicode - you should be able to degrade the unicode version of that text down to ascii or utf-8 with this module ( it's part of the standard library too )

Full Docs - http://docs.python.org/library/unicodedata.html

To look at any individual character and guess its encoding would be hard and probably not very accurate. However, you can use chardet to try and detect the encoding of an entire file. Then you can use the string decode() and encode() methods to convert its encoding to UTF-8.

http://pypi.python.org/pypi/chardet

And UTF-8 is backwards compatible with ASCII so that won't be a big deal.

来源：https://stackoverflow.com/questions/12830047/comprehensive-character-replacement-module-in-python-for-non-unicode-and-non-asc

标签

python

unicode

character-encoding

string-decoding