I have a string containing binary data in JavaScript. Now I want to read, for example, an integer from it. So I get the first 4 characters, use charCodeAt
, do s
borgars solution improvement:
...
do {
st.unshift( ch & 0xFF ); // push byte to stack
ch = ch >> 8; // shift value down by 1 byte
}
while ( ch );
// add stack contents to result
// done because chars have "wrong" endianness
re = re.concat( st );
...
While @Borgar answers the question correctly, his solution is pretty slow. It took me a while to track it down (I used his function somewhere in a larger project), so I thought I would share my insight.
I ended up having something like @Kadm. It's not some little percent faster, it's like 500 times faster (no exaggeration!). I wrote a little benchmark, so you can see it for yourself :)
function stringToBytesFaster ( str ) {
var ch, st, re = [], j=0;
for (var i = 0; i < str.length; i++ ) {
ch = str.charCodeAt(i);
if(ch < 127)
{
re[j++] = ch & 0xFF;
}
else
{
st = []; // clear stack
do {
st.push( ch & 0xFF ); // push byte to stack
ch = ch >> 8; // shift value down by 1 byte
}
while ( ch );
// add stack contents to result
// done because chars have "wrong" endianness
st = st.reverse();
for(var k=0;k<st.length; ++k)
re[j++] = st[k];
}
}
// return an array of bytes
return re;
}
One nice and quick hack is to use a combination of encodeURI and unescape :
t=[];
for(s=unescape(encodeURI("zażółć gęślą jaźń")),i=0;i<s.length;++i)
t.push(s.charCodeAt(i));
t
[122, 97, 197, 188, 195, 179, 197, 130, 196, 135, 32, 103, 196, 153, 197, 155, 108, 196, 133, 32, 106, 97, 197, 186, 197, 132]
Perhaps some explanation is necessary why the heck it works, so let me split it into steps:
encodeURI("zażółć gęślą jaźń")
returns
"za%C5%BC%C3%B3%C5%82%C4%87%20g%C4%99%C5%9Bl%C4%85%20ja%C5%BA%C5%84"
which -- if you look closely -- is the original string in which all characters with values>127 got replaced with (possibly more than one) hexadecimal bytes representations. For example letter "ż" became "%C5%BC". The fact is encodeURI escapes also some regular ascii characters like spaces, but it does not matter. What matters is that at this point each byte of the original string is either represented verbatim (as is the case with "z", "a", "g", or "j") or as a percent-encoded sequence of bytes (as was the case with "ż" which was originaly two bytes 197 and 188 and got converted to %C5 and %BC).
Now, we apply unescape:
unescape("za%C5%BC%C3%B3%C5%82%C4%87%20g%C4%99%C5%9Bl%C4%85%20ja%C5%BA%C5%84")
which gives
"zażóÅÄ gÄÅlÄ jaźÅ"
If you are not native Polish speaker you might not notice, that this result is in fact way different from the original "zażółć gęślą jaźń". For starters, it has a different number of characters :) For sure, you can tell, that this strange versions of big letter A do not belong to standard ascii set. In fact this "Å" has value 197. (which is exactly C5 in hexadecimal).
Now, if you are like me, you would ask yourself: wait a minute...if this is really a sequence of bytes with values 122, 97, 197, 188, and JS is really using UTF then why do I see this "ż" characters, and not the original "ż" ?
Well, the thing is (I belive) that this sequence 122, 97, 197, 188 (which we see when applying charCodeAt) is not a sequence of bytes, but a sequence of codes. The character "Å" has a code 197, but its actually two bytes long sequence: C3 85.
So, the trick works because unescape treats numbers occuring in percent-encoded string as codes, not as byte values - or, to be more specific: unescape knows nothing about multibyte characters, so when it decodes bytes one-by-one, handling values lower than 128 just great, but not-so-good when they are above 127 and multibyte -- unescape in such cases simply returns a multibyte character which happens to have a code equal to the requested byte value. This "bug" is actually useful feature.