Is there a way that I can add alias to python for encoding. There are sites on the web that are using the encoding \'windows-1251\' but have their charset set to win-1251, s
Encoding aliases can be added by editing aliases.py file.
# euc_jp codec
'eucjp' : 'euc_jp',
'ujis' : 'euc_jp',
'u_jis' : 'euc_jp',
'euc_jp_linux' : 'euc_jp',
'euc-jp-linux' : 'euc_jp',
Above I have added two aliases euc_jp_linux and euc-jp-linux to the encoding euc_jp.
For a 64 bit linux system aliases.py file is generally located under /usr/lib64/python2.6/encodings/
The encodings
module is not well documented so I'd instead use codecs
, which is:
import codecs
def encalias(oldname, newname):
old = codecs.lookup(oldname)
new = codecs.CodecInfo(old.encode, old.decode,
streamreader=old.streamreader,
streamwriter=old.streamwriter,
incrementalencoder=old.incrementalencoder,
incrementaldecoder=old.incrementaldecoder,
name=newname)
def searcher(aname):
if aname == newname:
return new
else:
return None
codecs.register(searcher)
This is Python 2.6 -- the interface is different in earlier versions.
If you don't mind relying on a specific version's undocumented internals, @Lennart's aliasing approach is OK, too, of course - and indeed simpler than this;-). But I suspect (as he appears to) that this one is more maintainable.
>>> import encodings
>>> encodings.aliases.aliases['win_1251'] = 'cp1251'
>>> print '\xcc\xce\xd1K\xc2\xc0'.decode('win-1251')
MOCKBA
Although I personally would consider this monkey-patching, and use my own conversion table. But I can't give any good arguments for that position. :)