How do I send email to addresses with non-ASCII characters in Python?

后端 未结 1 496
南方客
南方客 2021-01-07 09:16

Using the email and smtplib modules in Python 3.x, after a good amount of research, I can send emails with Unicode subjects, text bodies, and names

相关标签:
1条回答
  • 2021-01-07 09:41

    are email addresses intended to become ASCII-only for the whole world?

    No; in fact, the exact opposite. Email address were ASCII-only. They're intended to become Unicode, and we're on the way there; it's just been a slow transition.


    In modern email, there are two parts to an email address:1 the DNS hostname (the part after the @), and the mailbox on that host (the part before the @). They're governed by entirely different standards, because DNS has to work for HTTP and all kinds of other things besides just email.


    DNS was last updated back in 1987 in RFC 1035, which mandates a restricted subset of ASCII (and also case-insensitivity).

    However, IDNA (Internationalized Domain Names for Applications), specified in RFC 5890, allows applications to optionally map a much larger part of the Unicode character set to DNS names for presentation to the user.

    So, you cannot have the domain name dómain.com. But you can have the domain name xn--dmain-0ta.com. And many applications will accept dómain.com from user input and translate it automatically, and accept xn--dmain-0ta.com from the network and display it to dómain.com.2

    In Python, some libraries for internet protocols will automatically IDNA-encode domain names for you; otherwise will not. If they don't, you can do it manually, like this:

    >>> 'dómain.com'.encode('idna')
    b'xn--dmain-0ta.com'
    

    Notice that in 3.x, this is a bytes, not a str; if you need a str, you can always do this:

    >>> 'dómain.com'.encode('idna').decode('ascii')
    'xn--dmain-0ta.com'
    

    Mailbox names are defined by SMTP, most recently defined in RFC 5321 and RFC 5322, which make it clear that it's entirely up to the receiving host how to interpret the "local part" of an address. For example, most email servers use case-insensitive names; many allow "plus-tagging" (so, e.g., shule@gmail.com and shule+so@gmail.com are the same mailbox); some (like gmail) ignore all dots; etc.

    The problem is that SMTP has never specified what character set is in use for the headers. Traditional SMTP servers were 7-bit ASCII only, so, practically, until recently, you could only use ASCII in the headers, and therefore in the mailbox names.

    EAI (Email Address Internationalization), as specified in RFC 6530 and related proposals, allows negotiating UTF-8 in SMTP sessions. In a UTF-8 session, the headers, and the addresses in those headers, are interpreted as UTF-8. (IDNA-encoding of the hostname is not required but still allowed.)

    That's great, but what if your client, your server, your recipient's server, or any relaying servers along the way don't speak SMTPUTF8? To handle that case, everyone who has a UTF-8 mailbox also has an ASCII name for that mailbox. Ideally that gets sent along with the message, and the last SMTPUTF8 program on the chain switches to the ASCII substitute when it meets the first non-SMTPUTF8 program. More commonly, it just gets an error message and propagates it back to the user to deal with manually.3

    The idea is that eventually, most hosts on the internet will speak SMTPUTF8, so you can be úßerñame@dómain.com—but meanwhile, your server on dómain.com has úßerñame and ussernyame as aliases to the same mailbox. Anyone who can't handle SMTPUTF8 will see you (and have to refer to you) as ussernyame. (Their mail client will, in fact, see you as ussernyame@xn--dmain-0ta.com, but it can fix that last part; there's nothing it can do about the first part if it was lost in transport.)

    As of mid-2018, most hosts don't speak SMTPUTF8, and neither do many client libraries.

    As of Python 3.5,4 the standard library's smtplib supports SMTPUTF8. If you're using the high-level sendmail function:

    If SMTPUTF8 is included in mail_options, and the server supports it, from_addr and to_addrs may contain non-ASCII characters.

    So, what you do is something like this:

    try:
        server.sendmail([fromaddr], [toaddr], msg, mail_options=['SMTPUTF8'])
    except SMTPNotSupportedError:
        server.sendmail([fromaddr_ascii], [toaddr_ascii], msg)
    

    (In theory it's better to check the EHLO response with has_extn, but in practice, just trying it seems to worth more smoothly. That may change with future improvements in the server ecosystem and/or smptlib.)

    Where do you get that fromaddr_ascii and toaddr_ascii? That's up to your program. The DNS part, you just use IDNA, but for the mailbox part, there is no such rule; you have to know the mailbox's alternate ASCII mailbox name. Maybe you ask the user. Maybe you have a database that stores contacts with both EAI and traditional addresses. Maybe you're only worried about one specific domain and you know that it uses some rule that you can implement.


    1. Actually, there are two parts to an addr-spec; an address is an addr-spec plus optional display name and comment. But never mind that.

    2. There are a few exceptions. For example, if you type http://staсkoverflow.com, your browser might warn you that the Cyrillic lowercase Es in place of a Latin lowercase Cee might be a hijacking attempt. Or, if you try to navigate to http://dómain.com, the error page telling you that the domain doesn't exist will probably show you xn--dmain-0ta.com, because that's more useful for debugging.

    3. This is one of those things that will hopefully get better over time, but may well not get good enough until after it doesn't matter anymore anyway…

    4. What if you're using Python 3.4 or 2.7? Then you don't have SMTPUTF8 support. Upgrade, go find a third-party library instead of smtplib, or write your own SMTP code.

    0 讨论(0)
提交回复
热议问题