I create the following for truncating a string in java to a new string with a given number of bytes.
String truncatedValue = \"\";
String curren
This one could not be the more efficient solution but works
public static String substring(String s, int byteLimit) {
if (s.getBytes().length <= byteLimit) {
return s;
}
int n = Math.min(byteLimit-1, s.length()-1);
do {
s = s.substring(0, n--);
} while (s.getBytes().length > byteLimit);
return s;
}
As noted, Peter Lawrey solution has major performance disadvantage (~3,500msc for 10,000 times), Rex Kerr was much better (~500msc for 10,000 times) but the result not was accurate - it cut much more than it needed (instead of remaining 4000 bytes it remainds 3500 for some example). attached here my solution (~250msc for 10,000 times) assuming that UTF-8 max length char in bytes is 4 (thanks WikiPedia):
public static String cutWord (String word, int dbLimit) throws UnsupportedEncodingException{
double MAX_UTF8_CHAR_LENGTH = 4.0;
if(word.length()>dbLimit){
word = word.substring(0, dbLimit);
}
if(word.length() > dbLimit/MAX_UTF8_CHAR_LENGTH){
int residual=word.getBytes("UTF-8").length-dbLimit;
if(residual>0){
int tempResidual = residual,start, end = word.length();
while(tempResidual > 0){
start = end-((int) Math.ceil((double)tempResidual/MAX_UTF8_CHAR_LENGTH));
tempResidual = tempResidual - word.substring(start,end).getBytes("UTF-8").length;
end=start;
}
word = word.substring(0, end);
}
}
return word;
}
I think Rex Kerr's solution has 2 bugs.
Please find my corrected version below:
public String cut(String s, int charLimit) throws UnsupportedEncodingException {
byte[] utf8 = s.getBytes("UTF-8");
if (utf8.length <= charLimit) {
return s;
}
int n16 = 0;
boolean extraLong = false;
int i = 0;
while (i < charLimit) {
// Unicode characters above U+FFFF need 2 words in utf16
extraLong = ((utf8[i] & 0xF0) == 0xF0);
if ((utf8[i] & 0x80) == 0) {
i += 1;
} else {
int b = utf8[i];
while ((b & 0x80) > 0) {
++i;
b = b << 1;
}
}
if (i <= charLimit) {
n16 += (extraLong) ? 2 : 1;
}
}
return s.substring(0, n16);
}
I still thought this was far from effective. So if you don't really need the String representation of the result and the byte array will do, you can use this:
private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
byte[] utf8 = s.getBytes("UTF-8");
if (utf8.length <= charLimit) {
return utf8;
}
if ((utf8[charLimit] & 0x80) == 0) {
// the limit doesn't cut an UTF-8 sequence
return Arrays.copyOf(utf8, charLimit);
}
int i = 0;
while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
++i;
}
if ((utf8[charLimit-i-1] & 0x80) > 0) {
// we have to skip the starter UTF-8 byte
return Arrays.copyOf(utf8, charLimit-i-1);
} else {
// we passed all UTF-8 bytes
return Arrays.copyOf(utf8, charLimit-i);
}
}
Funny thing is that with a realistic 20-500 byte limit they perform pretty much the same IF you create a string from the byte array again.
Please note that both methods assume a valid utf-8 input which is a valid assumption after using Java's getBytes() function.
By using below Regular Expression also you can remove leading and trailing white space of double byte character.
stringtoConvert = stringtoConvert.replaceAll("^[\\s ]*", "").replaceAll("[\\s ]*$", "");
Binary search approach in scala:
private def bytes(s: String) = s.getBytes("UTF-8")
def truncateToByteLength(string: String, length: Int): String =
if (length <= 0 || string.isEmpty) ""
else {
@tailrec
def loop(badLen: Int, goodLen: Int, good: String): String = {
assert(badLen > goodLen, s"""badLen is $badLen but goodLen is $goodLen ("$good")""")
if (badLen == goodLen + 1) good
else {
val mid = goodLen + (badLen - goodLen) / 2
val midStr = string.take(mid)
if (bytes(midStr).length > length)
loop(mid, goodLen, good)
else
loop(badLen, mid, midStr)
}
}
loop(string.length * 2, 0, "")
}
Why not convert to bytes and walk forward--obeying UTF8 character boundaries as you do it--until you've got the max number, then convert those bytes back into a string?
Or you could just cut the original string if you keep track of where the cut should occur:
// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
public static String cut(String s, int n) {
byte[] utf8 = s.getBytes();
if (utf8.length < n) n = utf8.length;
int n16 = 0;
int advance = 1;
int i = 0;
while (i < n) {
advance = 1;
if ((utf8[i] & 0x80) == 0) i += 1;
else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
else { i += 4; advance = 2; }
if (i <= n) n16 += advance;
}
return s.substring(0,n16);
}
}
Note: edited to fix bugs on 2014-08-25