Converting domain names to idn in python

前端 未结 2 1375
别那么骄傲
别那么骄傲 2021-01-02 09:33

I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the com

相关标签:
2条回答
  • 2021-01-02 10:14

    you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.

    Then you can do (assuming 'utf-8'):

    
    infile = open(sys.argv[1])
    
    for line in infile:
        print line,
        domain = line.strip().decode('utf-8')
        print type(domain)
        print "IDN:", domain.encode("idna")
        print
    

    Convert encoded strings to unicode with decode. Convert unicode to string with encode. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.

    0 讨论(0)
  • 2021-01-02 10:26

    Your first example is fine, except that:

    domain = unicode(line.strip())
    

    you have to specify a particular encoding here: unicode(line.strip(), 'utf-8'). Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8') as in knitti's example; there is no difference in behaviour between the two syntaxes.

    However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.

    Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252' instead.

    0 讨论(0)
提交回复
热议问题