How to determine if a String contains invalid encoded characters

前端未结

关注

 10  1200

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the webs

相关标签:

10条回答

执念已碎

2020-12-02 12:03
I've been working on a similar "guess the encoding" problem. The best solution involves knowing the encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.

To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:
1. No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
2. Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
3. Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).
If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it is UTF-8, but it's a good predictor.

Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.

(I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)
0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2020-12-02 12:03
You need to setup the character encoding from the start. Try sending the proper Content-Type header, for example Content-Type: text/html; charset=utf-8 to fix the right encoding. The standard conformance refers to utf-8 and utf-16 as the proper encoding for Web Services. Examine your response headers.

Also, at the server side — in the case which the browser do not handles properly the encoding sent by the server — force the encoding by allocating a new String. Also you can check each byte in the encoded utf-8 string by doing a single each_byte & 0x80, verifying the result as non zero.
```
boolean utfEncoded = true;
byte[] strBytes = queryString.getBytes();
for (int i = 0; i < strBytes.length(); i++) {
    if ((strBytes[i] & 0x80) != 0) {
        continue;
    } else {
        /* treat the string as non utf encoded */
        utfEncoded = false;
        break;
    }
}

String realQueryString = utfEncoded ?
    queryString : new String(queryString.getBytes(), "iso-8859-1");
```
Also, take a look on this article, I hope it would help you.
0 讨论(0)
发布评论:

提交评论
- 加载中...

离开以前

2020-12-02 12:09

the following regular expression might be of interest for you:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185624

I use it in ruby as following:

module Encoding
    UTF8RGX = /\A(
        [\x09\x0A\x0D\x20-\x7E]            # ASCII
      | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
      |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
      | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
      |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
      |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
      | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
      |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*\z/x unless defined? UTF8RGX

    def self.utf8_file?(fileName)
      count = 0
      File.open("#{fileName}").each do |l|
        count += 1
        unless utf8_string?(l)
          puts count.to_s + ": " + l
        end
      end
      return true
    end

    def self.utf8_string?(a_string)
      UTF8RGX === a_string
    end

end

0 讨论(0)

夕颜

2020-12-02 12:11
Replace all control chars into empty string
```
value = value.replaceAll("\\p{Cntrl}", "");
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2