Remove non-utf8 characters from string

后端 未结 18 1420
心在旅途
心在旅途 2020-11-22 11:56

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

18条回答
  •  粉色の甜心
    2020-11-22 12:31

    So the rules are that the first UTF-8 octlet has the high bit set as a marker, and then 1 to 4 bits to indicate how many additional octlets; then each of the additional octlets must have the high two bits set to 10.

    The pseudo-python would be:

    newstring = ''
    cont = 0
    for each ch in string:
      if cont:
        if (ch >> 6) != 2: # high 2 bits are 10
          # do whatever, e.g. skip it, or skip whole point, or?
        else:
          # acceptable continuation of multi-octlet char
          newstring += ch
        cont -= 1
      else:
        if (ch >> 7): # high bit set?
          c = (ch << 1) # strip the high bit marker
          while (c & 1): # while the high bit indicates another octlet
            c <<= 1
            cont += 1
            if cont > 4:
               # more than 4 octels not allowed; cope with error
          if !cont:
            # illegal, do something sensible
          newstring += ch # or whatever
    if cont:
      # last utf-8 was not terminated, cope
    

    This same logic should be translatable to php. However, its not clear what kind of stripping is to be done once you get a malformed character.

提交回复
热议问题