Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

后端 未结 4 1141
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-10 12:19

Splitting a JavaScript string into \"characters\" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).

JavaScr

相关标签:
4条回答
  • 2020-12-10 12:29

    Another method using codePointAt:

    String.prototype.toCodePoints = function () {
      var arCP = [];
      for (var i = 0; i < this.length; i += 1) {
        var cP = this.codePointAt(i);
        arCP.push(cP);
        if (cP >= 0x10000) {
          i += 1;
        }
      }
      return arCP;
    }
    
    0 讨论(0)
  • 2020-12-10 12:34

    @bobince's answer has (luckily) become a bit dated; you can now simply use

    var chars = Array.from( text )
    

    to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.

    0 讨论(0)
  • 2020-12-10 12:35

    In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.

    Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(n²)), so we can't realistically use this approach for a while yet.

    So doing it the manual way:

    String.prototype.toCodePoints= function() {
        chars = [];
        for (var i= 0; i<this.length; i++) {
            var c1= this.charCodeAt(i);
            if (c1>=0xD800 && c1<0xDC00 && i+1<this.length) {
                var c2= this.charCodeAt(i+1);
                if (c2>=0xDC00 && c2<0xE000) {
                    chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
                    i++;
                    continue;
                }
            }
            chars.push(c1);
        }
        return chars;
    }
    

    For the inverse to this see https://stackoverflow.com/a/3759300/18936

    0 讨论(0)
  • 2020-12-10 12:36

    Along the lines of @John Frazer's answer, one can use this even succincter form of string iteration:

    const chars = [...text]
    

    e.g., with:

    const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
    const chars = [...text] // ["A", "                                                                    
    0 讨论(0)
提交回复
热议问题