Comprehensive character replacement module in python for non-unicode and non-ascii for HTML

别来无恙 提交于 2019-12-06 21:38:29

i don't think what you want is really possible - but i think there is a decent option.

unicodedata has a 'normalize' method that can gracefully degrade text for you...

import unicodedata
def gracefully_degrade_to_ascii( text ):
    return unicodedata.normalize('NFKD',text).encode('ascii','ignore')

assuming the charset you're using is already mapped into unicode - or at least can be mapped into unicode - you should be able to degrade the unicode version of that text down to ascii or utf-8 with this module ( it's part of the standard library too )

Full Docs - http://docs.python.org/library/unicodedata.html

To look at any individual character and guess its encoding would be hard and probably not very accurate. However, you can use chardet to try and detect the encoding of an entire file. Then you can use the string decode() and encode() methods to convert its encoding to UTF-8.

http://pypi.python.org/pypi/chardet

And UTF-8 is backwards compatible with ASCII so that won't be a big deal.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!