How to determine if a String contains invalid encoded characters

前端 未结 10 1197
眼角桃花
眼角桃花 2020-12-02 11:38

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the webs

相关标签:
10条回答
  • 2020-12-02 11:53

    You can use a CharsetDecoder configured to throw an exception if invalid chars are found:

     CharsetDecoder UTF8Decoder =
          Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);
    

    See CodingErrorAction.REPORT

    0 讨论(0)
  • 2020-12-02 11:53

    URLDecoder will decode to a given encoding. This should flag errors appropriately. However the documentation states:

    There are two possible ways in which this decoder could deal with illegal strings. It could either leave illegal characters alone or it could throw an IllegalArgumentException. Which approach the decoder takes is left to the implementation.

    So you should probably try it. Note also (from the decode() method documentation):

    The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites

    so there's something else to think about!

    EDIT: Apache Commons URLDecode claims to throw appropriate exceptions for bad encodings.

    0 讨论(0)
  • 2020-12-02 11:53

    You might want to include a known parameter in your requests, e.g. "...&encTest=ä€", to safely differentiate between the different encodings.

    0 讨论(0)
  • 2020-12-02 11:59

    Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI)

    One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.

    0 讨论(0)
  • 2020-12-02 12:00

    I asked the same question,

    Handling Character Encoding in URI on Tomcat

    I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,

    1. Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
    2. If you have to manually URL decode, use Latin1 as charset also.
    3. Use the fixEncoding() function to fix up encodings.

    For example, to get a parameter from query string,

      String name = fixEncoding(request.getParameter("name"));
    

    You can do this always. String with correct encoding is not changed.

    The code is attached. Good luck!

     public static String fixEncoding(String latin1) {
      try {
       byte[] bytes = latin1.getBytes("ISO-8859-1");
       if (!validUTF8(bytes))
        return latin1;   
       return new String(bytes, "UTF-8");  
      } catch (UnsupportedEncodingException e) {
       // Impossible, throw unchecked
       throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
      }
    
     }
    
     public static boolean validUTF8(byte[] input) {
      int i = 0;
      // Check for BOM
      if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
        && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
       i = 3;
      }
    
      int end;
      for (int j = input.length; i < j; ++i) {
       int octet = input[i];
       if ((octet & 0x80) == 0) {
        continue; // ASCII
       }
    
       // Check for UTF-8 leading byte
       if ((octet & 0xE0) == 0xC0) {
        end = i + 1;
       } else if ((octet & 0xF0) == 0xE0) {
        end = i + 2;
       } else if ((octet & 0xF8) == 0xF0) {
        end = i + 3;
       } else {
        // Java only supports BMP so 3 is max
        return false;
       }
    
       while (i < end) {
        i++;
        octet = input[i];
        if ((octet & 0xC0) != 0x80) {
         // Not a valid trailing byte
         return false;
        }
       }
      }
      return true;
     }
    

    EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?

    Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.

    The solution I posted here is not perfect but it's the best one we found so far.

    0 讨论(0)
  • This is what I used to check the encoding:

    CharsetDecoder ebcdicDecoder = Charset.forName("IBM1047").newDecoder();
    ebcdicDecoder.onMalformedInput(CodingErrorAction.REPORT);
    ebcdicDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    
    CharBuffer out = CharBuffer.wrap(new char[3200]);
    CoderResult result = ebcdicDecoder.decode(ByteBuffer.wrap(bytes), out, true);
    if (result.isError() || result.isOverflow() ||
        result.isUnderflow() || result.isMalformed() ||
        result.isUnmappable())
    {
        System.out.println("Cannot decode EBCDIC");
    }
    else
    {
        CoderResult result = ebcdicDecoder.flush(out);
        if (result.isOverflow())
           System.out.println("Cannot decode EBCDIC");
        if (result.isUnderflow())
            System.out.println("Ebcdic decoded succefully ");
    }
    

    Edit: updated with Vouze suggestion

    0 讨论(0)
提交回复
热议问题