How to detect whether a character belongs to a Right To Left language?

前端 未结 5 1238
失恋的感觉
失恋的感觉 2020-11-28 06:02

What is a good way to tell whether a string contains text in a Right To Left language.

I have found this question which suggests the following approach:



        
相关标签:
5条回答
  • 2020-11-28 06:21

    Unicode characters have different properties associated with them. These properties cannot be derived from the code point; you need a table that tells you if a character has a certain property or not.

    You are interested in characters with bidirectional property "R" or "AL" (RandALCat).

    A RandALCat character is a character with unambiguously right-to-left directionality.

    Here's the complete list as of Unicode 3.2 (from RFC 3454):

    D. Bidirectional tables
    
    D.1 Characters with bidirectional property "R" or "AL"
    
    ----- Start Table D.1 -----
    05BE
    05C0
    05C3
    05D0-05EA
    05F0-05F4
    061B
    061F
    0621-063A
    0640-064A
    066D-066F
    0671-06D5
    06DD
    06E5-06E6
    06FA-06FE
    0700-070D
    0710
    0712-072C
    0780-07A5
    07B1
    200F
    FB1D
    FB1F-FB28
    FB2A-FB36
    FB38-FB3C
    FB3E
    FB40-FB41
    FB43-FB44
    FB46-FBB1
    FBD3-FD3D
    FD50-FD8F
    FD92-FDC7
    FDF0-FDFC
    FE70-FE74
    FE76-FEFC
    ----- End Table D.1 -----
    

    Here's some code to get the complete list as of Unicode 6.0:

    var url = "http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt";
    
    var query = from record in new WebClient().DownloadString(url).Split('\n')
                where !string.IsNullOrEmpty(record)
                let properties = record.Split(';')
                where properties[4] == "R" || properties[4] == "AL"
                select int.Parse(properties[0], NumberStyles.AllowHexSpecifier);
    
    foreach (var codepoint in query)
    {
        Console.WriteLine(codepoint.ToString("X4"));
    }
    

    Note that these values are Unicode code points. Strings in C#/.NET are UTF-16 encoded and need to be converted to Unicode code points first (see Char.ConvertToUtf32). Here's a method that checks if a string contains at least one RandALCat character:

    static void IsAnyCharacterRightToLeft(string s)
    {
        for (var i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
        {
            var codepoint = char.ConvertToUtf32(s, i);
            if (IsRandALCat(codepoint))
            {
                return true;
            }
        }
        return false;
    }
    
    0 讨论(0)
  • 2020-11-28 06:21

    On my implementation of regex I could not use neither \u, \x, nor {} language named groups.

    So I built my own pattern programatically based on all "R" and "AL" (RandALCat) bidirectional characters as listed in UnicodeData.txt.

    [־׀׃א-״؛-ي٭-ە‏ײַ-ﳝﶈ-ﷺﺂ-ﻼ]
    

    This should be decently comprehensive and I've tested it on Arabic and Hebrew text so far.

    0 讨论(0)
  • 2020-11-28 06:22

    You can try using "named blocks" in regular expressions. Just pick out the blocks that are right to left, and form the regex. For example:

    \p{IsArabic}|\p{IsHebrew}
    

    If that regex returns true, then there was at least one hebrew or arabic character in the string.

    0 讨论(0)
  • 2020-11-28 06:26

    EDIT:

    This is what I use now, it includes the Vowelization chars and everything in Hebrew and Arabic:

    [\u0591-\u07FF]
    

    OLD ANSWER:

    If you need to detect RTL language in a sentence, this simplified RegEx will probably be enough:

    [א-ת؀-ۿ]
    

    If one wants to write something in Hebrew it will have to use one of these characters, and the case is similar with Arabic.

    It does not include vowelization characters, so if you need to catch all whole words or absolutely all RTL chars you better use one of the other answers. Vowelization chars in Hebrew are very rare in non-poetry texts. I don't know about Arabic texts.

    0 讨论(0)
  • 2020-11-28 06:40

    All "AL" or "R" of Unicode 6.0 (from http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt)

    bool hasRandALCat = 0;
    if(c >= 0x5BE && c <= 0x10B7F)
    {
        if(c <= 0x85E)
        {
            if(c == 0x5BE)                        hasRandALCat = 1;
            else if(c == 0x5C0)                   hasRandALCat = 1;
            else if(c == 0x5C3)                   hasRandALCat = 1;
            else if(c == 0x5C6)                   hasRandALCat = 1;
            else if(0x5D0 <= c && c <= 0x5EA)     hasRandALCat = 1;
            else if(0x5F0 <= c && c <= 0x5F4)     hasRandALCat = 1;
            else if(c == 0x608)                   hasRandALCat = 1;
            else if(c == 0x60B)                   hasRandALCat = 1;
            else if(c == 0x60D)                   hasRandALCat = 1;
            else if(c == 0x61B)                   hasRandALCat = 1;
            else if(0x61E <= c && c <= 0x64A)     hasRandALCat = 1;
            else if(0x66D <= c && c <= 0x66F)     hasRandALCat = 1;
            else if(0x671 <= c && c <= 0x6D5)     hasRandALCat = 1;
            else if(0x6E5 <= c && c <= 0x6E6)     hasRandALCat = 1;
            else if(0x6EE <= c && c <= 0x6EF)     hasRandALCat = 1;
            else if(0x6FA <= c && c <= 0x70D)     hasRandALCat = 1;
            else if(c == 0x710)                   hasRandALCat = 1;
            else if(0x712 <= c && c <= 0x72F)     hasRandALCat = 1;
            else if(0x74D <= c && c <= 0x7A5)     hasRandALCat = 1;
            else if(c == 0x7B1)                   hasRandALCat = 1;
            else if(0x7C0 <= c && c <= 0x7EA)     hasRandALCat = 1;
            else if(0x7F4 <= c && c <= 0x7F5)     hasRandALCat = 1;
            else if(c == 0x7FA)                   hasRandALCat = 1;
            else if(0x800 <= c && c <= 0x815)     hasRandALCat = 1;
            else if(c == 0x81A)                   hasRandALCat = 1;
            else if(c == 0x824)                   hasRandALCat = 1;
            else if(c == 0x828)                   hasRandALCat = 1;
            else if(0x830 <= c && c <= 0x83E)     hasRandALCat = 1;
            else if(0x840 <= c && c <= 0x858)     hasRandALCat = 1;
            else if(c == 0x85E)                   hasRandALCat = 1;
        }
        else if(c == 0x200F)                      hasRandALCat = 1;
        else if(c >= 0xFB1D)
        {
            if(c == 0xFB1D)                       hasRandALCat = 1;
            else if(0xFB1F <= c && c <= 0xFB28)   hasRandALCat = 1;
            else if(0xFB2A <= c && c <= 0xFB36)   hasRandALCat = 1;
            else if(0xFB38 <= c && c <= 0xFB3C)   hasRandALCat = 1;
            else if(c == 0xFB3E)                  hasRandALCat = 1;
            else if(0xFB40 <= c && c <= 0xFB41)   hasRandALCat = 1;
            else if(0xFB43 <= c && c <= 0xFB44)   hasRandALCat = 1;
            else if(0xFB46 <= c && c <= 0xFBC1)   hasRandALCat = 1;
            else if(0xFBD3 <= c && c <= 0xFD3D)   hasRandALCat = 1;
            else if(0xFD50 <= c && c <= 0xFD8F)   hasRandALCat = 1;
            else if(0xFD92 <= c && c <= 0xFDC7)   hasRandALCat = 1;
            else if(0xFDF0 <= c && c <= 0xFDFC)   hasRandALCat = 1;
            else if(0xFE70 <= c && c <= 0xFE74)   hasRandALCat = 1;
            else if(0xFE76 <= c && c <= 0xFEFC)   hasRandALCat = 1;
            else if(0x10800 <= c && c <= 0x10805) hasRandALCat = 1;
            else if(c == 0x10808)                 hasRandALCat = 1;
            else if(0x1080A <= c && c <= 0x10835) hasRandALCat = 1;
            else if(0x10837 <= c && c <= 0x10838) hasRandALCat = 1;
            else if(c == 0x1083C)                 hasRandALCat = 1;
            else if(0x1083F <= c && c <= 0x10855) hasRandALCat = 1;
            else if(0x10857 <= c && c <= 0x1085F) hasRandALCat = 1;
            else if(0x10900 <= c && c <= 0x1091B) hasRandALCat = 1;
            else if(0x10920 <= c && c <= 0x10939) hasRandALCat = 1;
            else if(c == 0x1093F)                 hasRandALCat = 1;
            else if(c == 0x10A00)                 hasRandALCat = 1;
            else if(0x10A10 <= c && c <= 0x10A13) hasRandALCat = 1;
            else if(0x10A15 <= c && c <= 0x10A17) hasRandALCat = 1;
            else if(0x10A19 <= c && c <= 0x10A33) hasRandALCat = 1;
            else if(0x10A40 <= c && c <= 0x10A47) hasRandALCat = 1;
            else if(0x10A50 <= c && c <= 0x10A58) hasRandALCat = 1;
            else if(0x10A60 <= c && c <= 0x10A7F) hasRandALCat = 1;
            else if(0x10B00 <= c && c <= 0x10B35) hasRandALCat = 1;
            else if(0x10B40 <= c && c <= 0x10B55) hasRandALCat = 1;
            else if(0x10B58 <= c && c <= 0x10B72) hasRandALCat = 1;
            else if(0x10B78 <= c && c <= 0x10B7F) hasRandALCat = 1;
        }
    }
    
    0 讨论(0)
提交回复
热议问题