I\'m using the Zemanta API, which accepts up to 8 KB of text per call. I\'m extracting the text to send to Zemanta from Web pages using JavaScript, so I\'m looking for a fun
If you are using a single-byte encoding, yes, 8192 characters=8192 bytes. If you are using UTF-16, 8192 characters(*)=4096 bytes.
(Actually 8192 code-points, which is a slightly different thing in the face of surrogates, but let's not worry about that because JavaScript doesn't.)
If you are using UTF-8, there's a quick trick you can use to implement a UTF-8 encoder/decoder in JS with minimal code:
function toBytesUTF8(chars) {
return unescape(encodeURIComponent(chars));
}
function fromBytesUTF8(bytes) {
return decodeURIComponent(escape(bytes));
}
Now you can truncate with:
function truncateByBytesUTF8(chars, n) {
var bytes= toBytesUTF8(chars).substring(0, n);
while (true) {
try {
return fromBytesUTF8(bytes);
} catch(e) {};
bytes= bytes.substring(0, bytes.length-1);
}
}
(The reason for the try-catch there is that if you truncate the bytes in the middle of a multibyte character sequence you'll get an invalid UTF-8 stream and decodeURIComponent will complain.)
If it's another multibyte encoding such as Shift-JIS or Big5, you're on your own.