Using unicode characters bigger than 2 bytes with .Net

前端 未结 4 1853
说谎
说谎 2020-12-15 08:32

I\'m using this code to generate U+10FFFC

var s = Encoding.UTF8.GetString(new byte[] {0xF4,0x8F,0xBF,0xBC});

I know it\'s for

4条回答
  •  醉梦人生
    2020-12-15 09:19

    While @R. Martinho Fernandes's answer is correct, his AsCodePoints extension method has two issues:

    1. It will throw an ArgumentException on invalid code points (high surrogate without low surrogate or vice versa).
    2. You can't use char static methods that take (char) or (string, int) (such as char.IsNumber()) if you only have int code points.

    I've split the code into two methods, one similar to the original but returns the Unicode Replacement Character on invalid code points. The second method returns a struct IEnumerable with more useful fields:

    StringCodePointExtensions.cs

    public static class StringCodePointExtensions {
    
        const char ReplacementCharacter = '\ufffd';
    
        public static IEnumerable CodePointIndexes(this string s) {
            for (int i = 0; i < s.Length; i++) {
                if (char.IsHighSurrogate(s, i)) {
                    if (i + 1 < s.Length && char.IsLowSurrogate(s, i + 1)) {
                        yield return CodePointIndex.Create(i, true, true);
                        i++;
                        continue;
    
                    } else {
                        // High surrogate without low surrogate
                        yield return CodePointIndex.Create(i, false, false);
                        continue;
                    }
    
                } else if (char.IsLowSurrogate(s, i)) {
                    // Low surrogate without high surrogate
                    yield return CodePointIndex.Create(i, false, false);
                    continue;
                }
    
                yield return CodePointIndex.Create(i, true, false);
            }
        }
    
        public static IEnumerable CodePointInts(this string s) {
            return s
                .CodePointIndexes()
                .Select(
                cpi => {
                    if (cpi.Valid) {
                        return char.ConvertToUtf32(s, cpi.Index);
                    } else {
                        return (int)ReplacementCharacter;
                    }
                });
        }
    }
    

    CodePointIndex.cs:

    public struct CodePointIndex {
        public int Index;
        public bool Valid;
        public bool IsSurrogatePair;
    
        public static CodePointIndex Create(int index, bool valid, bool isSurrogatePair) {
            return new CodePointIndex {
                Index = index,
                Valid = valid,
                IsSurrogatePair = isSurrogatePair,
            };
        }
    }
    

    To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.

提交回复
热议问题