For a vector of logical values, why does R allocate 4 bytes, when a bit vector would consume 1 bit per entry? (See this question for examples.)
Now, I realize that
Other answers have gotten at the (likely) architectural reasons that logical vectors are implemented taking the same space as integers. I wanted to point out the bit package which implements a one-bit (no NA
) logical.
Knowing a little something about R and S-Plus, I'd say that R most likely did it to be compatible with S-Plus, and S-Plus most likely did it because it was the easiest thing to do...
Basically, a logical vector is identical to an integer vector, so sum
and other algorithms for integers work pretty much unchanged on logical vectors.
In 64-bit S-Plus, the integers are 64-bit and thus also the logical vectors! That's 8 bytes per logical value...
@Iterator is of course correct that a logical vector should be represented in a more compact form. Since there is already a raw
vector type which is 1-byte, it would seem like a very simple change to use that one for logicals too. And 2 bits per value would of course be even better - I'd probably keep them as two separate bit vectors (TRUE/FALSE and NA/Valid), and the NA bit vector could be NULL if there are no NAs...
Anyway, that's mostly a dream since there are so many RAPI packages (packages that use the R C/FORTRAN APIs) out there that would break...
Without knowing R at all, I suspect for much the same reason as C does, because it's way faster to load a size equal to the processors native word size.
Loading a single bit would be slow, especially from a bitfield since you'd have to mask out the bits that do not apply to your particular query. With a whole word you can just load it in a registry and be done with it. Since the size difference usually is not a problem the default implementation is to use a word sized variable. If the user wants something else there is always the option to do the bit-shifting required manually.
Concerning packing, at least on some processors it will cause a fault to read from a non-aligned address. So while you might declare a structure with a single byte
in it surrounded by two int
the byte
might be padded to be 4 bytes in size regardless. Again, I don't know anything about R in particular, but I suspect the behaviour might be the same for performance reasons.
Addressing a single byte in an array is quite more involved, say you have an array bitfield
and want to address bit x
in it, the code would be something like this:
bit b = (bitfield[x/8] >> (x % 8)) & 1
to obtain either 0 or 1 for the bit you requested. In comparison to the straightforward array addressing of from a boolean array obtaining value number x: bool a = array[x]