I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the com
you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.
Then you can do (assuming 'utf-8'):
infile = open(sys.argv[1])
for line in infile:
print line,
domain = line.strip().decode('utf-8')
print type(domain)
print "IDN:", domain.encode("idna")
print
Convert encoded strings to unicode with decode
. Convert unicode to string with encode
. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.
Your first example is fine, except that:
domain = unicode(line.strip())
you have to specify a particular encoding here: unicode(line.strip(), 'utf-8')
. Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8')
as in knitti's example; there is no difference in behaviour between the two syntaxes.
However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test
file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.
Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252'
instead.