Using JavaScript to truncate text to a certain size (8 KB)

后端未结

关注

 4  1031

南笙 2021-01-12 05:57

I\'m using the Zemanta API, which accepts up to 8 KB of text per call. I\'m extracting the text to send to Zemanta from Web pages using JavaScript, so I\'m looking for a fun

4条回答

借酒劲吻你 (楼主)

2021-01-12 06:31
If you are using a single-byte encoding, yes, 8192 characters=8192 bytes. If you are using UTF-16, 8192 characters(*)=4096 bytes.

(Actually 8192 code-points, which is a slightly different thing in the face of surrogates, but let's not worry about that because JavaScript doesn't.)

If you are using UTF-8, there's a quick trick you can use to implement a UTF-8 encoder/decoder in JS with minimal code:
```
function toBytesUTF8(chars) {
    return unescape(encodeURIComponent(chars));
}
function fromBytesUTF8(bytes) {
    return decodeURIComponent(escape(bytes));
}
```
Now you can truncate with:
```
function truncateByBytesUTF8(chars, n) {
    var bytes= toBytesUTF8(chars).substring(0, n);
    while (true) {
        try {
            return fromBytesUTF8(bytes);
        } catch(e) {};
        bytes= bytes.substring(0, bytes.length-1);
    }
}
```
(The reason for the try-catch there is that if you truncate the bytes in the middle of a multibyte character sequence you'll get an invalid UTF-8 stream and decodeURIComponent will complain.)

If it's another multibyte encoding such as Shift-JIS or Big5, you're on your own.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...