Calculating length in UTF-8 of Java String without actually encoding it

后端未结

关注

 4  824

Does anyone know if the standard Java library (any version) provides a means of calculating the length of the binary encoding of a string (specifically UTF-8 in this case) w

相关标签:

4条回答

耶瑟儿～

2020-12-03 04:49

Here's an implementation based on the UTF-8 specification:

public class Utf8LenCounter {
  public static int length(CharSequence sequence) {
    int count = 0;
    for (int i = 0, len = sequence.length(); i < len; i++) {
      char ch = sequence.charAt(i);
      if (ch <= 0x7F) {
        count++;
      } else if (ch <= 0x7FF) {
        count += 2;
      } else if (Character.isHighSurrogate(ch)) {
        count += 4;
        ++i;
      } else {
        count += 3;
      }
    }
    return count;
  }
}

This implementation is not tolerant of malformed strings.

Here's a JUnit 4 test for verification:

public class LenCounterTest {
  @Test public void testUtf8Len() {
    Charset utf8 = Charset.forName("UTF-8");
    AllCodepointsIterator iterator = new AllCodepointsIterator();
    while (iterator.hasNext()) {
      String test = new String(Character.toChars(iterator.next()));
      Assert.assertEquals(test.getBytes(utf8).length,
                          Utf8LenCounter.length(test));
    }
  }

  private static class AllCodepointsIterator {
    private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/
    private static final int SURROGATE_FIRST = 0xD800;
    private static final int SURROGATE_LAST = 0xDFFF;
    private int codepoint = 0;
    public boolean hasNext() { return codepoint < MAX; }
    public int next() {
      int ret = codepoint;
      codepoint = next(codepoint);
      return ret;
    }
    private int next(int codepoint) {
      while (codepoint++ < MAX) {
        if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; }
        if (!Character.isDefined(codepoint)) { continue; }
        return codepoint;
      }
      return MAX;
    }
  }
}

Please excuse the compact formatting.

0 讨论(0)

不知归路

2020-12-03 05:00

You can loop thru the String:

/**
 * Deprecated: doesn't support surrogate characters.
 */
@Deprecated
public int countUTF8Length(String str)
{
    int count = 0;
    for (int i = 0; i < str.length(); ++i)
    {
        char c = str.charAt(i);
        if (c < 0x80)
        {
            count++;
        } else if (c < 0x800)
        {
            count +=2;
        } else
            throw new UnsupportedOperationException("not implemented yet");
        }
    }
    return count;
}

0 讨论(0)

半阙折子戏

2020-12-03 05:02
Using Guava's Utf8:
```
Utf8.encodedLength("some really long string")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

鱼传尺愫

2020-12-03 05:05

The best method I could come up with is to use CharsetEncoder to write repeatedly into the same temporary buffer:

public int getEncodedLength(CharBuffer src, CharsetEncoder encoder)
    throws CharacterCodingException
{
    // CharsetEncoder.flush fails if encode is not called with >0 chars
    if (!src.hasRemaining())
        return 0;

    // encode into a byte buffer that is repeatedly overwritten
    final ByteBuffer outputBuffer = ByteBuffer.allocate(1024);

    // encoding loop
    int bytes = 0;
    CoderResult status;
    do
    {
        status = encoder.encode(src, outputBuffer, true);
        if (status.isError())
            status.throwException();
        bytes += outputBuffer.position();

        outputBuffer.clear();
    }
    while (status.isOverflow());

    // flush any remaining buffered state
    status = encoder.flush(outputBuffer);
    if (status.isError() || status.isOverflow())
        status.throwException();
    bytes += outputBuffer.position();

    return bytes;
}

public int getUtf8Length(String str) throws CharacterCodingException
{
    return getEncodedLength(CharBuffer.wrap(str),
        Charset.forName("UTF-8").newEncoder());
}

0 讨论(0)