How can I write a regular expression that matches all valid Spanish and Arabic words.
In English I know, it is a-zA-z
, in Hebrew it is א-ת
, i
The range a-zA-Z
for English words is unacceptably simple and naïve. It leaves out all manner of letters with accents and other special marks that are used in loan words, etc. For instance, it won't match the word "naïve", from my first sentence. Use the \p{Latin}
script, instead.
The range א-ת
for Hebrew words is also wrong. It leaves out Hebrew presentation forms, cantillation marks, Yiddish digraphs, and more. Use the \p{Hebrew}
script, instead.
The range А-Яа-яёЁ
for Russian is again incomplete and wrong. Use the \p{Cyrillic}
script, instead.
The Spanish alphabet uses the same 26 letters as English, plus ñÑ. But again, don't hardcode these into a range. Many Spanish words use accented vowels. Use the \p{Latin}
script to match Spanish words. Regexes won't help you distinguish Spanish from English.
For Arabic, use the \p{Arabic}
script.
You said you're using JavaScript. Unfortunately, JavaScript has very little support for Unicode built-in. In JavaScript, you need to use the XRegExp library and its Unicode addon. That will allow you to use all of the Unicode scripts I mentioned above in your regular expressions.
Always favor Unicode scripts over Unicode blocks. Blocks match up poorly with the code points in a particular script. Blocks very often leave out many important code points that fall outside of their incomplete range, and include many code points that have not been assigned any character. Scripts include all relevant code points, and no more.