Why do logicals (booleans) in R require 4 bytes?

后端 未结 3 1459
情歌与酒
情歌与酒 2020-12-08 10:13

For a vector of logical values, why does R allocate 4 bytes, when a bit vector would consume 1 bit per entry? (See this question for examples.)

Now, I realize that

相关标签:
3条回答
  • 2020-12-08 10:20

    Other answers have gotten at the (likely) architectural reasons that logical vectors are implemented taking the same space as integers. I wanted to point out the bit package which implements a one-bit (no NA) logical.

    0 讨论(0)
  • 2020-12-08 10:28

    Knowing a little something about R and S-Plus, I'd say that R most likely did it to be compatible with S-Plus, and S-Plus most likely did it because it was the easiest thing to do...

    Basically, a logical vector is identical to an integer vector, so sum and other algorithms for integers work pretty much unchanged on logical vectors.

    In 64-bit S-Plus, the integers are 64-bit and thus also the logical vectors! That's 8 bytes per logical value...

    @Iterator is of course correct that a logical vector should be represented in a more compact form. Since there is already a raw vector type which is 1-byte, it would seem like a very simple change to use that one for logicals too. And 2 bits per value would of course be even better - I'd probably keep them as two separate bit vectors (TRUE/FALSE and NA/Valid), and the NA bit vector could be NULL if there are no NAs...

    Anyway, that's mostly a dream since there are so many RAPI packages (packages that use the R C/FORTRAN APIs) out there that would break...

    0 讨论(0)
  • 2020-12-08 10:42

    Without knowing R at all, I suspect for much the same reason as C does, because it's way faster to load a size equal to the processors native word size.

    Loading a single bit would be slow, especially from a bitfield since you'd have to mask out the bits that do not apply to your particular query. With a whole word you can just load it in a registry and be done with it. Since the size difference usually is not a problem the default implementation is to use a word sized variable. If the user wants something else there is always the option to do the bit-shifting required manually.

    Concerning packing, at least on some processors it will cause a fault to read from a non-aligned address. So while you might declare a structure with a single byte in it surrounded by two int the byte might be padded to be 4 bytes in size regardless. Again, I don't know anything about R in particular, but I suspect the behaviour might be the same for performance reasons.

    Addressing a single byte in an array is quite more involved, say you have an array bitfield and want to address bit x in it, the code would be something like this:

    bit b = (bitfield[x/8] >> (x % 8)) & 1
    

    to obtain either 0 or 1 for the bit you requested. In comparison to the straightforward array addressing of from a boolean array obtaining value number x: bool a = array[x]

    0 讨论(0)
提交回复
热议问题