Concrete Javascript Regex for Accented Characters (Diacritics)

后端 未结 9 1016
庸人自扰
庸人自扰 2020-11-22 17:22

I\'ve looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn\'t follow the Unicode standard concerning RegExp, etc.) and haven\'t really found a concrete

相关标签:
9条回答
  • 2020-11-22 17:34

    What about this?

    ^([a-zA-Z]|[à-ú]|[À-Ú])+$
    

    It will match every word with accented characters or not.

    0 讨论(0)
  • 2020-11-22 17:38

    from this wiki : https://en.wikipedia.org/wiki/List_of_Unicode_characters#Basic_Latin

    for latin letters, I use

    /^[A-zÀ-ÖØ-öø-ÿ]+$/ 
    

    it avoids hyphens and specials chars

    0 讨论(0)
  • 2020-11-22 17:39

    The accented Latin range \u00C0-\u017F was not quite enough for my database of names, so I extended the regex to

    [a-zA-Z\u00C0-\u024F]
    [a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars
    

    I added these code blocks (\u00C0-\u024F includes three adjacent blocks at once):

    • \u00C0-\u00FF Latin-1 Supplement
    • \u0100-\u017F Latin Extended-A
    • \u0180-\u024F Latin Extended-B
    • \u1E00-\u1EFF Latin Extended Additional

    Note that \u00C0-\u00FF is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × \u00D7 and divide ÷ \u00F7.

    [a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷
    

    If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser.

    The original regex stopping at \u017F borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is \u0218, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S \u015E, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")

    0 讨论(0)
  • 2020-11-22 17:39

    Which of these three approaches is most suited for the task?

    Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S character class.

    I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first)

    The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:

    /[^,]+,\s[^,]+/
    

    But your second solution with the . character class is just as fine, you only might need to care about multiple commata then.

    0 讨论(0)
  • 2020-11-22 17:41

    The XRegExp library has a plugin named Unicode that helps solve tasks like this.

    <script src="xregexp.js"></script>
    <script src="addons/unicode/unicode-base.js"></script>
    <script>
      var unicodeWord = XRegExp("^\\p{L}+$");
    
      unicodeWord.test("Русский"); // true
      unicodeWord.test("日本語"); // true
      unicodeWord.test("العربية"); // true
    </script>
    

    It's mentioned in the comments to the question, but it's easy to miss. I've noticed it only after I submitted this answer.

    0 讨论(0)
  • 2020-11-22 17:41

    You can remove the diacritics from alphabets by using:

    var str = "résumé"
    str.normalize('NFD').replace(/[\u0300-\u036f]/g, '') // returns resume
    

    It will remove all the diacritical marks, and then perform your regex on it

    Reference:

    https://thread.engineering/2018-08-29-searching-and-sorting-text-with-diacritical-marks-in-javascript/

    0 讨论(0)
提交回复
热议问题