Using JavaScript to truncate text to a certain size (8 KB)

后端 未结 4 1031
南笙
南笙 2021-01-12 05:57

I\'m using the Zemanta API, which accepts up to 8 KB of text per call. I\'m extracting the text to send to Zemanta from Web pages using JavaScript, so I\'m looking for a fun

4条回答
  •  借酒劲吻你
    2021-01-12 06:31

    If you are using a single-byte encoding, yes, 8192 characters=8192 bytes. If you are using UTF-16, 8192 characters(*)=4096 bytes.

    (Actually 8192 code-points, which is a slightly different thing in the face of surrogates, but let's not worry about that because JavaScript doesn't.)

    If you are using UTF-8, there's a quick trick you can use to implement a UTF-8 encoder/decoder in JS with minimal code:

    function toBytesUTF8(chars) {
        return unescape(encodeURIComponent(chars));
    }
    function fromBytesUTF8(bytes) {
        return decodeURIComponent(escape(bytes));
    }
    

    Now you can truncate with:

    function truncateByBytesUTF8(chars, n) {
        var bytes= toBytesUTF8(chars).substring(0, n);
        while (true) {
            try {
                return fromBytesUTF8(bytes);
            } catch(e) {};
            bytes= bytes.substring(0, bytes.length-1);
        }
    }
    

    (The reason for the try-catch there is that if you truncate the bytes in the middle of a multibyte character sequence you'll get an invalid UTF-8 stream and decodeURIComponent will complain.)

    If it's another multibyte encoding such as Shift-JIS or Big5, you're on your own.

提交回复
热议问题