Converting domain names to idn in python

前端未结

关注

 2  1375

I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the com

相关标签:

2条回答

太阳男子

2021-01-02 10:14
you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.

Then you can do (assuming 'utf-8'):
```
infile = open(sys.argv[1])

for line in infile:
    print line,
    domain = line.strip().decode('utf-8')
    print type(domain)
    print "IDN:", domain.encode("idna")
    print
```
Convert encoded strings to unicode with decode. Convert unicode to string with encode. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.
0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2021-01-02 10:26
Your first example is fine, except that:
```
domain = unicode(line.strip())
```
you have to specify a particular encoding here: unicode(line.strip(), 'utf-8'). Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8') as in knitti's example; there is no difference in behaviour between the two syntaxes.

However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.

Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252' instead.
0 讨论(0)
发布评论:

提交评论
- 加载中...