How to convert IETF BCP 47 language identifier to ISO-639-2?

半世苍凉 提交于 2019-12-05 11:13:29

BCP 47 identifiers start with a 2 letter ISO 639-1 or 3 letter 639-2, 639-3 or 639-5 language code; see the RFC 5646 Syntax section:

Language-Tag  = langtag             ; normal language tags
              / privateuse          ; private use tag
              / grandfathered       ; grandfathered tags

langtag       = language
                ["-" script]
                ["-" region]
                *("-" variant)
                *("-" extension)
                ["-" privateuse]

language      = 2*3ALPHA            ; shortest ISO 639 code
                ["-" extlang]       ; sometimes followed by
                                    ; extended language subtags
              / 4ALPHA              ; or reserved for future use
              / 5*8ALPHA            ; or registered language subtag

I don't expect Apple to use the privateuse or grandfathered forms, so you can assume that you are looking at ISO 639-1, ISO 639-2, ISO 639-3 or ISO 639-5 language codes here. Simply map the 2-letter ISO-639-1 codes to 3-letter ISO 639-* codes.

You can use the pycountry package for this:

import pycountry

lang = pycountry.languages.get(alpha2=two_letter_code)
three_letter_code = lang.terminology

Demo:

>>> import pycountry
>>> lang = pycountry.languages.get(alpha2='aa')
>>> lang.terminology
u'aar'

where the terminology form is the preferred 3-letter code; there is also a bibliography form which differs only for 22 entries. See ISO 639-2 B and T codes. The package doesn't include entries from ISO 639-5 however; that list overlaps and conflicts with 639-2 in places and I don't think Apple uses such codes at all.

From RFC5646/BCP47:

Language-Tag  = langtag             ; normal language tags
              / privateuse          ; private use tag
              / grandfathered       ; grandfathered tags

langtag       = language
                ["-" script]
                ["-" region]
                *("-" variant)
                *("-" extension)
                ["-" privateuse]

language      = 2*3ALPHA            ; shortest ISO 639 code
                ["-" extlang]       ; sometimes followed by
                                    ; extended language subtags
              / 4ALPHA              ; or reserved for future use
              / 5*8ALPHA            ; or registered language subtag

privateuse    = "x" 1*("-" (1*8alphanum))

grandfathered = irregular           ; non-redundant tags registered
              / regular             ; during the RFC 3066 era

It looks like the first segment of most BCP-47 codes should be valid ISO-639 codes though they might not be the three letter variants. A BCP-47 language code has a few variants that are not ISO-639 codes - namely those beginning with x- or i- as well as a number of legacy codes that match the grandfathered portion of the grammar:

irregular     = "en-GB-oed"         ; irregular tags do not match
              / "sgn-BE-FR"         ; also includes i- prefixed codes
              / "sgn-BE-NL"
              / "sgn-CH-DE"

regular       = "art-lojban"        ; these tags match the 'langtag'
              / "cel-gaulish"       ; production, but their subtags
              / "no-bok"            ; are not extended language
              / "no-nyn"            ; or variant subtags: their meaning
              / "zh-guoyu"          ; is defined by their registration
              / "zh-hakka"          ; and all of these are deprecated
              / "zh-min"            ; in favor of a more modern
              / "zh-min-nan"        ; subtag or sequence of subtags
              / "zh-xiang"

A good start would be something like the following:

def extract_iso_code(bcp_identifier):
    language, _ = bcp_identifier.split('-', 1)
    if 2 <= len(language) <=3:
        # this is a valid ISO-639 code or is grandfathered
    else:
        # handle non-ISO codes
        raise ValueError(bcp_identifier)

Conversion from the 2-character variant to the 3-character variant should be easy enough to handle since the mapping is well known.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!