Remove non-utf8 characters from string

后端未结

关注

 18  1420

心在旅途 2020-11-22 11:56

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

18条回答

粉色の甜心 (楼主)

2020-11-22 12:31

So the rules are that the first UTF-8 octlet has the high bit set as a marker, and then 1 to 4 bits to indicate how many additional octlets; then each of the additional octlets must have the high two bits set to 10.

The pseudo-python would be:

newstring = ''
cont = 0
for each ch in string:
  if cont:
    if (ch >> 6) != 2: # high 2 bits are 10
      # do whatever, e.g. skip it, or skip whole point, or?
    else:
      # acceptable continuation of multi-octlet char
      newstring += ch
    cont -= 1
  else:
    if (ch >> 7): # high bit set?
      c = (ch << 1) # strip the high bit marker
      while (c & 1): # while the high bit indicates another octlet
        c <<= 1
        cont += 1
        if cont > 4:
           # more than 4 octels not allowed; cope with error
      if !cont:
        # illegal, do something sensible
      newstring += ch # or whatever
if cont:
  # last utf-8 was not terminated, cope

This same logic should be translatable to php. However, its not clear what kind of stripping is to be done once you get a malformed character.

0 讨论(0)

查看其它18个回答