preg_match verification of non English email addresses (international domain names)

问题

We all know email address verification is a touchy subject, there are so many opinions on the best way to deal with it without encoding for the entire RFC. But since 2009 its become even more difficult and I haven't really seen anyone address the issue of IDN's yet.

Here is what I've been using:

preg_match(/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,6}\z/i)

Which will work for most email addresses but what if I need to match a non Latin email address? e.g.: bob@china.中國, or bob@russia.рф

Look here for the complete list. (Notice all the non Latin domain extensions at the bottom of the list.)

Information on this subject can be found here and I think what they are saying is these new characters will simply be read as '.xn--fiqz9s' and '.xn--p1ai' on the machine level but I'm not 100% sure.

If it is, does that mean the only change I need to consider making in my code the following? (For domain extensions like .travelersinsurance and .sandvikcoromant)

preg_match(/^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,20}\z/i)

NOTICE: This is not related to the discussion found on this page Using a regular expression to validate an email address

回答1:

Consider: Every time you make up your own new regex without validating addresses according to the complete RFC spec, you're just making the situation for using "exotic" email addresses on the web worse. You're inventing some new ad-hoc sub or superset of the official RFC spec; that means you will either have false positives or false negatives or both, you will deny people to use their actual addresses because your regex doesn't account for them correctly, or you will accept addresses which are actually invalid.

Add to that that even if the address is syntactically valid, that still doesn't mean a) the address actually (still) exists, b) belongs to that user or c) can actually receive email. In the grant scheme of things, validating the syntax is an extremely minor concern.

If you're going to validate the syntax at all, either do a very rough general check which is sure to not reject any valid addresses (e.g. /.+@.+/), or validate according to all RFC rules; don't do some in-between half-assed sort-of-strict-but-not-really validation you just came up with.

回答2:

I'm gonna stick with the tried and true suggestion that you should send them a verification email. No need for a fancy regex that will need to be updated time and time again. Just assume they know their email address and let them enter it.

That's what I've always done when this situation comes up. If anything I would make them enter their email twice. It'll free you up to spend more time on the important parts of your site/project.

回答3:

Here is what I eventually came up with.

preg_match(/^[\pL\pM*+\pN._%+-]+@[\pL\pM*+\pN.-]+\.[\pL\pM*+]{2,20}\z/u)

This uses Unicode regular expressions like \pL, \pM*+ and \pN to help me deal with characters and numbers from any language.

\pL Any kind of letter from any language, upper or lower case.

\pM*+ Matches zero or more code points that are combining marks. A character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

\pN Any number.

The expression above will work perfectly for normal email addresses like me@mydomain.com and cacophonous email addresses like a.s中3_yÄhমহাজোটেরoo文%网+d-fελληνικά@πyÄhooαράδειγμα.δοκιμή.

It's not that I don't trust people to be able to type in their own email addresses but people do make mistakes and I may use this code in other situations. For example: I need to double check the integrity of an existing list of 10,000 email addresses. Besides, I was always taught to NOT trust user input and to ALWAYS filter.

UPDATE

I just discovered that though this works perfectly when tested on sites like phpliveregex.com and locally when parsing a normal string for utf-8 content it doesn't work properly with email fields because browsers converting fields of that content type to normal latin. So an email address like bob@china.中國, or bob@russia.рф does get converted before being received by the server to bob@china.xn--fiqz9s, or bob@russia.xn--p1ai. The only thing I was really missing from my original filter was the inclusion of hyphens from the domain extention.

Here is the final version:

preg_match('/^[a-z0-9%+-._]+@[a-z0-9-.]+\.[a-z0-9-]{2,20}\z/i');

来源：https://stackoverflow.com/questions/35638351/preg-match-verification-of-non-english-email-addresses-international-domain-nam

标签

php

regex

idn