问题
I want to convert unicode string into iso-8859-15. These strings include the u"\u2019"
(RIGHT SINGLE QUOTATION MARK see http://www.fileformat.info/info/unicode/char/2019/index.htm) character which is not part of the iso-8859-15 characters set.
In Python, how to normalize the unicode characters in order to match the iso-8859-15 encoding?
I have looked at the unicodedata module without success. I manage to do the job with
s.replace(u"\u2019", "'").encode('iso-8859-15')
but I would like to find a more general and cleaner way.
Thanks for your help
回答1:
Use the unicode version of the translate function, assuming s
is a unicode string:
s.translate({ord(u"\u2019"):ord(u"'")})
The argument of the unicode version of translate
is a dict mapping unicode ordinals to unicode ordinals. Add to this dict other characters you cannot encode in your target encoding.
You can build your mapping table in a little more readable form and create your mapping dict from it, for instance:
char_mappings = [(u"\u2019", u"'"),
(u"`", u"'")]
translate_mapping = {ord(k):ord(v) for k,v in char_mappings}
From translate documentation:
For Unicode objects, the translate() method does not accept the optional deletechars argument. Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted. Note, a more flexible approach is to create a custom character mapping codec using the codecs module (see encodings.cp1251 for an example).
回答2:
Unless you wish to create a translation rule (if you do, look at Boud's answer), you could choose one of the default error handlers encode
provides or even register your own one:
In [4]: u'\u2019 Hi'.encode('iso-8859-15', 'replace')
Out[4]: '? Hi'
In [5]: u'\u2019 Hi'.encode('iso-8859-15', 'ignore')
Out[5]: ' Hi'
In [6]: u'\u2019 Hi'.encode('iso-8859-15', 'xmlcharrefreplace')
Out[6]: '’ Hi'
From encode
docstring:
S.encode([encoding[,errors]]) -> string or unicode
Encodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is 'strict' meaning that encoding errors raise a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and 'xmlcharrefreplace' as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.
回答3:
For info, my final solution:
iso885915_utf_map = {
u"\u2019": u"'",
u"\u2018": u"'",
u"\u201c": u'"',
u"\u201d": u'"',
}
utf_map = dict([(ord(k), ord(v)) for k,v in iso885915_utf_map.items()])
s.translate(utf_map).encode('iso-8859-15')
Thank you for your help
来源:https://stackoverflow.com/questions/10785231/how-to-normalize-unicode-encoding-for-iso-8859-15-conversion-in-python