Regular expression for a language tag (as defined by BCP47)

前端 未结 4 694
情书的邮戳
情书的邮戳 2021-02-05 18:52

I need a regular expression for a language tag as defined by BCP 47.

I know that the full BNF syntax is available at http://www.rfc-editor.org/rfc/bcp/bcp47.txt and th

相关标签:
4条回答
  • 2021-02-05 18:52

    Looks like this:

    ^((?<grandfathered>(en-GB-oed|i-ami|i-bnn|i-default|i-enochian|i-hak|i-klingon|i-lux|
    i-mingo|i-navajo|i-pwn|i-tao|i-tay|i-tsu|sgn-BE-FR|sgn-BE-NL|sgn-CH-DE)|(art-lojban|
    cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang))|((?<language>
    ([A-Za-z]{2,3}(-(?<extlang>[A-Za-z]{3}(-[A-Za-z]{3}){0,2}))?)|[A-Za-z]{4}|[A-Za-z]{5,8})
    (-(?<script>[A-Za-z]{4}))?(-(?<region>[A-Za-z]{2}|[0-9]{3}))?(-(?<variant>[A-Za-z0-9]{5,8}
    |[0-9][A-Za-z0-9]{3}))*(-(?<extension>[0-9A-WY-Za-wy-z](-[A-Za-z0-9]{2,8})+))*
    (-(?<privateUse>x(-[A-Za-z0-9]{1,8})+))?)|(?<privateUse>x(-[A-Za-z0-9]{1,8})+))$
    

    Here is the code to generate it (in C#):

    var regular = "(art-lojban|cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang)";
    var irregular = "(en-GB-oed|i-ami|i-bnn|i-default|i-enochian|i-hak|i-klingon|i-lux|i-mingo|i-navajo|i-pwn|i-tao|i-tay|i-tsu|sgn-BE-FR|sgn-BE-NL|sgn-CH-DE)";
    var grandfathered = "(?<grandfathered>" + irregular + "|" + regular + ")";
    var privateUse = "(?<privateUse>x(-[A-Za-z0-9]{1,8})+)";
    var singleton = "[0-9A-WY-Za-wy-z]";
    var extension = "(?<extension>" + singleton + "(-[A-Za-z0-9]{2,8})+)";
    var variant = "(?<variant>[A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3})";
    var region = "(?<region>[A-Za-z]{2}|[0-9]{3})";
    var script = "(?<script>[A-Za-z]{4})";
    var extlang = "(?<extlang>[A-Za-z]{3}(-[A-Za-z]{3}){0,2})";
    var language = "(?<language>([A-Za-z]{2,3}(-" + extlang + ")?)|[A-Za-z]{4}|[A-Za-z]{5,8})";
    var langtag = "(" + language + "(-" + script + ")?" + "(-" + region + ")?" + "(-" + variant + ")*" + "(-" + extension + ")*" + "(-" + privateUse + ")?" + ")";
    var languageTag = @"^(" + grandfathered + "|" + langtag + "|" + privateUse + ")$";
    
    Console.WriteLine(languageTag);
    

    I cannot guarantee its correctness (I may have made typos), but it works fine on the examples in Appendix A.

    Depending on your environment, you might need to remove the named capturing groups "?<...>".

    0 讨论(0)
  • 2021-02-05 19:10

    An optimized version that works in PHP.

    /^(?<grandfathered>(?:en-GB-oed|i-(?:ami|bnn|default|enochian|hak|klingon|lux|mingo|navajo|pwn|t(?:a[oy]|su))|sgn-(?:BE-(?:FR|NL)|CH-DE))|(?:art-lojban|cel-gaulish|no-(?:bok|nyn)|zh-(?:guoyu|hakka|min(?:-nan)?|xiang)))|(?:(?<language>(?:[A-Za-z]{2,3}(?:-(?<extlang>[A-Za-z]{3}(?:-[A-Za-z]{3}){0,2}))?)|[A-Za-z]{4}|[A-Za-z]{5,8})(?:-(?<script>[A-Za-z]{4}))?(?:-(?<region>[A-Za-z]{2}|[0-9]{3}))?(?:-(?<variant>[A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3}))*(?:-(?<extension>[0-9A-WY-Za-wy-z](?:-[A-Za-z0-9]{2,8})+))*)(?:-(?<privateUse>x(?:-[A-Za-z0-9]{1,8})+))?$/Di
    
    0 讨论(0)
  • 2021-02-05 19:15

    Javascript polices duplicate named capture groups so you have to change the 2nd use of ?<privateUse> to e.g. ?<privateUse1>. Compiles to:

    /^((?<grandfathered>(en-GB-oed|i-ami|i-bnn|i-default|i-enochian|i-hak|i-klingon|i-lux|i-mingo|i-navajo|i-pwn|i-tao|i-tay|i-tsu|sgn-BE-FR|sgn-BE-NL|sgn-CH-DE)|(art-lojban|cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang))|((?<language>([A-Za-z]{2,3}(-(?<extlang>[A-Za-z]{3}(-[A-Za-z]{3}){0,2}))?)|[A-Za-z]{4}|[A-Za-z]{5,8})(-(?<script>[A-Za-z]{4}))?(-(?<region>[A-Za-z]{2}|[0-9]{3}))?(-(?<variant>[A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3}))*(-(?<extension>[0-9A-WY-Za-wy-z](-[A-Za-z0-9]{2,8})+))*(-(?<privateUse>x(-[A-Za-z0-9]{1,8})+))?)|(?<privateUse1>x(-[A-Za-z0-9]{1,8})+))$/
    

    Here's a way to construct it:

    let privateUseUsed = 0
    const privateUse = () => "(?<privateUse" + (privateUseUsed++) + ">x(-[A-Za-z0-9]{1,8})+)"
    const grandfathered = "(?<grandfathered>" +
          /* irregular */ (
            "en-GB-oed" +
              "|" + "i-(?:ami|bnn|default|enochian|hak|klingon|lux|mingo|navajo|pwn|tao|tay|tsu)" +
              "|" + "sgn-(?:BE-FR|BE-NL|CH-DE)"
          ) +
          "|" + /* regular */ (
            "art-lojban|cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang"
          ) +
          ")"
    const langtag = "(" +
          "(?<language>" + (
            "([A-Za-z]{2,3}(-" +
              "(?<extlang>[A-Za-z]{3}(-[A-Za-z]{3}){0,2})" +
              ")?)|[A-Za-z]{4,8})"
          ) +
          "(-" + "(?<script>[A-Za-z]{4})" + ")?" +
          "(-" + "(?<region>[A-Za-z]{2}|[0-9]{3})" + ")?" +
          "(-" + "(?<variant>[A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3})" + ")*" +
          "(-" + "(?<extension>" + (
            /* singleton */ "[0-9A-WY-Za-wy-z]" +
              "(-[A-Za-z0-9]{2,8})+)"
          ) +
          ")*" +
          "(-" + privateUse() + ")?" +
          ")"
    const languageTagReStr = "^(" + grandfathered + "|" + langtag + "|" + privateUse() + ")$";
    

    Edit: turns out ff doens't support named capture groups so you have to strip them out with .replace(/\?<a-zA-Z>/g, '') or jest leave them out in the first place:

    const grandfathered = "(" +
          /* irregular */ "(en-GB-oed|i-ami|i-bnn|i-default|i-enochian|i-hak|i-klingon|i-lux|i-mingo|i-navajo|i-pwn|i-tao|i-tay|i-tsu|sgn-BE-FR|sgn-BE-NL|sgn-CH-DE)" +
          "|" +
          /* regular */ "(art-lojban|cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang)" +
          ")";
    const langtag = "(" +
          "(" + (
            "([A-Za-z]{2,3}(-" +
              "([A-Za-z]{3}(-[A-Za-z]{3}){0,2})" +
              ")?)|[A-Za-z]{4}|[A-Za-z]{5,8})"
          ) +
          "(-" + "([A-Za-z]{4})" + ")?" +
          "(-" + "([A-Za-z]{2}|[0-9]{3})" + ")?" +
          "(-" + "([A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3})" + ")*" +
          "(-" + "(" + (
            /* singleton */ "[0-9A-WY-Za-wy-z]" +
              "(-[A-Za-z0-9]{2,8})+)"
          ) +
          ")*" +
          "(-" + "(x(-[A-Za-z0-9]{1,8})+)" + ")?" +
          ")";
    const languageTag = RegExp("^(" + grandfathered + "|" + langtag + "|" + "(x(-[A-Za-z0-9]{1,8})+)" + ")$");
    

    Test with languageTag.test('en-us')

    0 讨论(0)
  • 2021-02-05 19:18

    If using a CLDR-based function set, like PHP's intl extension, you can check if a locale exists in the intl database using a function like:

    <?php
     function is_locale($locale=''){
      // STANDARDISE INPUT
      $locale=locale_canonicalize($locale);
    
      // LOAD ARRAY WITH LOCALES
      $locales=resourcebundle_locales(NULL);
    
      // RETURN WHETHER FOUND
      return (array_search($locale,$locales)!==F);
     }
    ?>
    

    It takes about half a millisecond to load and search the data, so it won't be too much of a performance hit.

    Of course, it will only find those in the database of the CLDR version supplied with the PHP version used, but will be updated with each subsequent PHP release.

    0 讨论(0)
提交回复
热议问题