How to determine if a String contains invalid encoded characters

前端 未结 10 1200
眼角桃花
眼角桃花 2020-12-02 11:38

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the webs

相关标签:
10条回答
  • 2020-12-02 12:03

    I've been working on a similar "guess the encoding" problem. The best solution involves knowing the encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.

    To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:

    1. No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
    2. Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
    3. Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).

    If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it is UTF-8, but it's a good predictor.

    Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.

    (I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)

    0 讨论(0)
  • 2020-12-02 12:03

    You need to setup the character encoding from the start. Try sending the proper Content-Type header, for example Content-Type: text/html; charset=utf-8 to fix the right encoding. The standard conformance refers to utf-8 and utf-16 as the proper encoding for Web Services. Examine your response headers.

    Also, at the server side — in the case which the browser do not handles properly the encoding sent by the server — force the encoding by allocating a new String. Also you can check each byte in the encoded utf-8 string by doing a single each_byte & 0x80, verifying the result as non zero.

    
    boolean utfEncoded = true;
    byte[] strBytes = queryString.getBytes();
    for (int i = 0; i < strBytes.length(); i++) {
        if ((strBytes[i] & 0x80) != 0) {
            continue;
        } else {
            /* treat the string as non utf encoded */
            utfEncoded = false;
            break;
        }
    }
    
    String realQueryString = utfEncoded ?
        queryString : new String(queryString.getBytes(), "iso-8859-1");
    

    Also, take a look on this article, I hope it would help you.

    0 讨论(0)
  • 2020-12-02 12:09

    the following regular expression might be of interest for you:

    http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185624

    I use it in ruby as following:

    module Encoding
        UTF8RGX = /\A(
            [\x09\x0A\x0D\x20-\x7E]            # ASCII
          | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
          |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
          | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
          |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
          |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
          | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
          |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )*\z/x unless defined? UTF8RGX
    
        def self.utf8_file?(fileName)
          count = 0
          File.open("#{fileName}").each do |l|
            count += 1
            unless utf8_string?(l)
              puts count.to_s + ": " + l
            end
          end
          return true
        end
    
        def self.utf8_string?(a_string)
          UTF8RGX === a_string
        end
    
    end
    
    0 讨论(0)
  • 2020-12-02 12:11

    Replace all control chars into empty string

    value = value.replaceAll("\\p{Cntrl}", "");
    
    0 讨论(0)
提交回复
热议问题