Why are types always a certain size no matter its value?

后端 未结 19 1889
谎友^
谎友^ 2021-01-30 15:22

Implementations might differ between the actual sizes of types, but on most, types like unsigned int and float are always 4 bytes. But why does a type always occupy a certai

19条回答
  •  说谎
    说谎 (楼主)
    2021-01-30 16:10

    The compiler is supposed to produce assembler (and ultimately machine code) for some machine, and generally C++ tries to be sympathetic to that machine.

    Being sympathetic to the underlying machine means roughly: making it easy to write C++ code which will map efficiently onto the operations the machine can execute quickly. So, we want to provide access to the data types and operations that are fast and "natural" on our hardware platform.

    Concretely, consider a specific machine architecture. Let's take the current Intel x86 family.

    The Intel® 64 and IA-32 Architectures Software Developer’s Manual vol 1 (link), section 3.4.1 says:

    The 32-bit general-purpose registers EAX, EBX, ECX, EDX, ESI, EDI, EBP, and ESP are provided for holding the following items:

    • Operands for logical and arithmetic operations

    • Operands for address calculations

    • Memory pointers

    So, we want the compiler to use these EAX, EBX etc. registers when it compiles simple C++ integer arithmetic. This means that when I declare an int, it should be something compatible with these registers, so that I can use them efficiently.

    The registers are always the same size (here, 32 bits), so my int variables will always be 32 bits as well. I'll use the same layout (little-endian) so that I don't have to do a conversion every time I load a variable value into a register, or store a register back into a variable.

    Using godbolt we can see exactly what the compiler does for some trivial code:

    int square(int num) {
        return num * num;
    }
    

    compiles (with GCC 8.1 and -fomit-frame-pointer -O3 for simplicity) to:

    square(int):
      imul edi, edi
      mov eax, edi
      ret
    

    this means:

    1. the int num parameter was passed in register EDI, meaning it's exactly the size and layout Intel expect for a native register. The function doesn't have to convert anything
    2. the multiplication is a single instruction (imul), which is very fast
    3. returning the result is simply a matter of copying it to another register (the caller expects the result to be put in EAX)

    Edit: we can add a relevant comparison to show the difference using a non-native layout makes. The simplest case is storing values in something other than native width.

    Using godbolt again, we can compare a simple native multiplication

    unsigned mult (unsigned x, unsigned y)
    {
        return x*y;
    }
    
    mult(unsigned int, unsigned int):
      mov eax, edi
      imul eax, esi
      ret
    

    with the equivalent code for a non-standard width

    struct pair {
        unsigned x : 31;
        unsigned y : 31;
    };
    
    unsigned mult (pair p)
    {
        return p.x*p.y;
    }
    
    mult(pair):
      mov eax, edi
      shr rdi, 32
      and eax, 2147483647
      and edi, 2147483647
      imul eax, edi
      ret
    

    All the extra instructions are concerned with converting the input format (two 31-bit unsigned integers) into the format the processor can handle natively. If we wanted to store the result back into a 31-bit value, there would be another one or two instructions to do this.

    This extra complexity means you'd only bother with this when the space saving is very important. In this case we're only saving two bits compared to using the native unsigned or uint32_t type, which would have generated much simpler code.


    A note on dynamic sizes:

    The example above is still fixed-width values rather than variable-width, but the width (and alignment) no longer match the native registers.

    The x86 platform has several native sizes, including 8-bit and 16-bit in addition to the main 32-bit (I'm glossing over 64-bit mode and various other things for simplicity).

    These types (char, int8_t, uint8_t, int16_t etc.) are also directly supported by the architecture - partly for backwards compatibility with older 8086/286/386/etc. etc. instruction sets.

    It's certainly the case that choosing the smallest natural fixed-size type that will suffice, can be good practice - they're still quick, single instructions loads and stores, you still get full-speed native arithmetic, and you can even improve performance by reducing cache misses.

    This is very different to variable-length encoding - I've worked with some of these, and they're horrible. Every load becomes a loop instead of a single instruction. Every store is also a loop. Every structure is variable-length, so you can't use arrays naturally.


    A further note on efficiency

    In subsequent comments, you've been using the word "efficient", as far as I can tell with respect to storage size. We do sometimes choose to minimize storage size - it can be important when we're saving very large numbers of values to files, or sending them over a network. The trade-off is that we need to load those values into registers to do anything with them, and performing the conversion isn't free.

    When we discuss efficiency, we need to know what we're optimizing, and what the trade-offs are. Using non-native storage types is one way to trade processing speed for space, and sometimes makes sense. Using variable-length storage (for arithmetic types at least), trades more processing speed (and code complexity and developer time) for an often-minimal further saving of space.

    The speed penalty you pay for this means it's only worthwhile when you need to absolutely minimize bandwidth or long-term storage, and for those cases it's usually easier to use a simple and natural format - and then just compress it with a general-purpose system (like zip, gzip, bzip2, xy or whatever).


    tl;dr

    Each platform has one architecture, but you can come up with an essentially unlimited number of different ways to represent data. It's not reasonable for any language to provide an unlimited number of built-in data types. So, C++ provides implicit access the platform's native, natural set of data types, and allows you to code any other (non-native) representation yourself.

提交回复
热议问题