Why do those Thai characters display on the web page with a long tail?

前端 未结 4 657
北荒
北荒 2021-02-01 16:54

ด้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้дด็็็็็้้้้้็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้

I found some interesting characters just

4条回答
  •  暖寄归人
    2021-02-01 17:14

    The codes you mention are all in UTF-8, which is why each character needs 3 bytes. The respectice Unicode codes are:

    • DO DEK 0x0e14

    • MAI THO 0x0e49

    • MAITAIKHU 0x0e47

    The latter two are in the category Mark, Nonspacing, and have the Combine property (Canonical_Combining_Class) set to 107, meaning that the code points are combined with the preceding code point in rendering.

    You example starts with a single character and adds lots of nonspacing marks on top of it.

    Compare with this C# code:

    char DODEK = (char)0x0e14;
    char MAITHO = (char)0x0e49;
    char MAITAIKHU = (char)0x0e47;
    
    string thai = new string(new char[] { DODEK, MAITHO, MAITAIKHU });
    Console.WriteLine("number of code points: " + thai.Length);
    
    var si = new System.Globalization.StringInfo(thai);
    Console.WriteLine("number of text elements: " + si.LengthInTextElements);
    

    Output:

    number of code points: 3
    number of text elements: 1
    

    See also .Net StringInfo class.

提交回复
热议问题