Calculating length in UTF-8 of Java String without actually encoding it

后端 未结 4 824
你的背包
你的背包 2020-12-03 04:45

Does anyone know if the standard Java library (any version) provides a means of calculating the length of the binary encoding of a string (specifically UTF-8 in this case) w

相关标签:
4条回答
  • 2020-12-03 04:49

    Here's an implementation based on the UTF-8 specification:

    public class Utf8LenCounter {
      public static int length(CharSequence sequence) {
        int count = 0;
        for (int i = 0, len = sequence.length(); i < len; i++) {
          char ch = sequence.charAt(i);
          if (ch <= 0x7F) {
            count++;
          } else if (ch <= 0x7FF) {
            count += 2;
          } else if (Character.isHighSurrogate(ch)) {
            count += 4;
            ++i;
          } else {
            count += 3;
          }
        }
        return count;
      }
    }
    

    This implementation is not tolerant of malformed strings.

    Here's a JUnit 4 test for verification:

    public class LenCounterTest {
      @Test public void testUtf8Len() {
        Charset utf8 = Charset.forName("UTF-8");
        AllCodepointsIterator iterator = new AllCodepointsIterator();
        while (iterator.hasNext()) {
          String test = new String(Character.toChars(iterator.next()));
          Assert.assertEquals(test.getBytes(utf8).length,
                              Utf8LenCounter.length(test));
        }
      }
    
      private static class AllCodepointsIterator {
        private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/
        private static final int SURROGATE_FIRST = 0xD800;
        private static final int SURROGATE_LAST = 0xDFFF;
        private int codepoint = 0;
        public boolean hasNext() { return codepoint < MAX; }
        public int next() {
          int ret = codepoint;
          codepoint = next(codepoint);
          return ret;
        }
        private int next(int codepoint) {
          while (codepoint++ < MAX) {
            if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; }
            if (!Character.isDefined(codepoint)) { continue; }
            return codepoint;
          }
          return MAX;
        }
      }
    }
    

    Please excuse the compact formatting.

    0 讨论(0)
  • 2020-12-03 05:00

    You can loop thru the String:

    /**
     * Deprecated: doesn't support surrogate characters.
     */
    @Deprecated
    public int countUTF8Length(String str)
    {
        int count = 0;
        for (int i = 0; i < str.length(); ++i)
        {
            char c = str.charAt(i);
            if (c < 0x80)
            {
                count++;
            } else if (c < 0x800)
            {
                count +=2;
            } else
                throw new UnsupportedOperationException("not implemented yet");
            }
        }
        return count;
    }
    
    0 讨论(0)
  • 2020-12-03 05:02

    Using Guava's Utf8:

    Utf8.encodedLength("some really long string")
    
    0 讨论(0)
  • 2020-12-03 05:05

    The best method I could come up with is to use CharsetEncoder to write repeatedly into the same temporary buffer:

    public int getEncodedLength(CharBuffer src, CharsetEncoder encoder)
        throws CharacterCodingException
    {
        // CharsetEncoder.flush fails if encode is not called with >0 chars
        if (!src.hasRemaining())
            return 0;
    
        // encode into a byte buffer that is repeatedly overwritten
        final ByteBuffer outputBuffer = ByteBuffer.allocate(1024);
    
        // encoding loop
        int bytes = 0;
        CoderResult status;
        do
        {
            status = encoder.encode(src, outputBuffer, true);
            if (status.isError())
                status.throwException();
            bytes += outputBuffer.position();
    
            outputBuffer.clear();
        }
        while (status.isOverflow());
    
        // flush any remaining buffered state
        status = encoder.flush(outputBuffer);
        if (status.isError() || status.isOverflow())
            status.throwException();
        bytes += outputBuffer.position();
    
        return bytes;
    }
    
    public int getUtf8Length(String str) throws CharacterCodingException
    {
        return getEncodedLength(CharBuffer.wrap(str),
            Charset.forName("UTF-8").newEncoder());
    }
    
    0 讨论(0)
提交回复
热议问题