If you want to create your own libirary, you need to use the table of permitted codepoints (IANA — Repository of IDN Practices, IDN Character Validation Guidance, IDNA Parameters) and the table of
Unicode Script properties (UNIDATA/Scripts.txt).
Gmail adopt the Unicode Consortium’s “Highly Restricted” specification (Protecting Gmail in a global world).
The following comibinations of Unicode Scripts are permitted.
- Single script
- Latin + Han + Hiragana + Katakana
- Latin + Han + Bopomofo
- Latin + Han + Hangul
You may need to pay attention to special script property values (Common, Inherited, Unknown) since some of characters has multiple properties or wrong properties.
For example, U+3099 (COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK) has two proierties ( "Katakana" and "Hiragana") and PCRE function classify it as "Inherited". Another example is U+x2A708. Althogh the right script property of U+2A708(comibination of U+30C8 KATAKANA LETTER TO and U+30E2 KATAKANA LETTER MO) is "Katakana", The Unicode Specification misclassify it as "Han".
You may need to consider IDN homograph attack. Google Chrome's IDN policy adopts the blacklist chars.
My recommendation is to use Zend\Validator\Hostname. This library uses the table of permitted code points for Japanese and Chinese.
If you use Symfony, consider upgrade the app of version to 2.5 which adopts egulias/email-validatornd (Manual).
You need extra validation whether the string is well-formed byte sequense. See my reporta> for the detail.
Don't forget XSS and SQL injection. The following address is valid email address based RFC5322.
// From Japanese tutorial
// http://blog.tokumaru.org/2013/11/xsssqlrfc5322.html
"><script>alert('or/**/1=1#')</script>"@example.jp
I think it's doubtful for using idn_to_ascii for validation since idn_to_ascii passes almost all characters.
for ($i = 0; $i < 0x110000; ++$i) {
$c = utf8_chr($i);
if ($c !== '' && false !== idn_to_ascii($c)) {
$number = strtoupper(dechex($i));
$length = strlen($number);
if ($i < 0x10000) {
$number = str_repeat('0', 4 - $length).$number;
}
$idn = $c.'example.com';
echo 'U+'.$number.' ';
echo ' '.$idn.' '. idn_to_ascii($idn);
echo PHP_EOL;
}
}
function utf8_chr($code_point) {
if ($code_point < 0 || 0x10FFFF < $code_point || (0xD800 <= $code_point && $code_point <= 0xDFFF)) {
return '';
}
if ($code_point < 0x80) {
$hex[0] = $code_point;
$ret = chr($hex[0]);
} else if ($code_point < 0x800) {
$hex[0] = 0x1C0 | $code_point >> 6;
$hex[1] = 0x80 | $code_point & 0x3F;
$ret = chr($hex[0]).chr($hex[1]);
} else if ($code_point < 0x10000) {
$hex[0] = 0xE0 | $code_point >> 12;
$hex[1] = 0x80 | $code_point >> 6 & 0x3F;
$hex[2] = 0x80 | $code_point & 0x3F;
$ret = chr($hex[0]).chr($hex[1]).chr($hex[2]);
} else {
$hex[0] = 0xF0 | $code_point >> 18;
$hex[1] = 0x80 | $code_point >> 12 & 0x3F;
$hex[2] = 0x80 | $code_point >> 6 & 0x3F;
$hex[3] = 0x80 | $code_point & 0x3F;
$ret = chr($hex[0]).chr($hex[1]).chr($hex[2]).chr($hex[3]);
}
return $ret;
}
If you want to validate domain by Unicode Script properties, use PCRE functions.
The following code show how to get tne name of Unicode script property. If you want to che the Unicode Script peroperties in JavaScript, use mathiasbynens/unicode-data.
function get_unicode_script_name($c) {
// http://php.net/manual/regexp.reference.unicode.php
$names = [
'Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali',
'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal',
'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform',
'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs',
'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati',
'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic',
'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese',
'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin',
'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic',
'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian',
'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian',
'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa',
'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian',
'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog',
'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana',
'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi'
];
$ret = [];
foreach ($names as $name) {
$pattern = '/\p{'.$name.'}/u';
if (preg_match($pattern, $c)) {
return $name;
}
}
return '';
}