Implementations might differ between the actual sizes of types, but on most, types like unsigned int and float are always 4 bytes. But why does a type always occupy a certai
Because types fundamentally represent storage, and they are defined in terms of maximum value they can hold, not the current value.
The very simple analogy would be a house - a house has a fixed size, regardless of how many people live in it, and there is also a building code which stipulates the maximum number of people who can live in a house of a certain size.
However, even if a single person is living in a house which can accommodate 10, the size of the house is not going to be affected by the current number of occupants.
Computer memory is subdivided into consecutively-addressed chunks of a certain size (often 8 bits, and referred to as bytes), and most computers are designed to efficiently access sequences of bytes that have consecutive addresses.
If an object's address never changes within the object's lifetime, then code given its address can quickly access the object in question. An essential limitation with this approach, however, is that if an address is assigned for address X, and then another address is assigned for address Y which is N bytes away, then X will not be able to grow larger than N bytes within the lifetime of Y, unless either X or Y is moved. In order for X to move, it would be necessary that everything in the universe that holds X's address be updated to reflect the new one, and likewise for Y to move. While it's possible to design a system to facilitate such updates (both Java and .NET manage it pretty well) it's much more efficient to work with objects that will stay in the same location throughout their lifetime, which in turn generally require that their size must remain constant.
The compiler is supposed to produce assembler (and ultimately machine code) for some machine, and generally C++ tries to be sympathetic to that machine.
Being sympathetic to the underlying machine means roughly: making it easy to write C++ code which will map efficiently onto the operations the machine can execute quickly. So, we want to provide access to the data types and operations that are fast and "natural" on our hardware platform.
Concretely, consider a specific machine architecture. Let's take the current Intel x86 family.
The Intel® 64 and IA-32 Architectures Software Developer’s Manual vol 1 (link), section 3.4.1 says:
The 32-bit general-purpose registers EAX, EBX, ECX, EDX, ESI, EDI, EBP, and ESP are provided for holding the following items:
• Operands for logical and arithmetic operations
• Operands for address calculations
• Memory pointers
So, we want the compiler to use these EAX, EBX etc. registers when it compiles simple C++ integer arithmetic. This means that when I declare an int
, it should be something compatible with these registers, so that I can use them efficiently.
The registers are always the same size (here, 32 bits), so my int
variables will always be 32 bits as well. I'll use the same layout (little-endian) so that I don't have to do a conversion every time I load a variable value into a register, or store a register back into a variable.
Using godbolt we can see exactly what the compiler does for some trivial code:
int square(int num) {
return num * num;
}
compiles (with GCC 8.1 and -fomit-frame-pointer -O3
for simplicity) to:
square(int):
imul edi, edi
mov eax, edi
ret
this means:
int num
parameter was passed in register EDI, meaning it's exactly the size and layout Intel expect for a native register. The function doesn't have to convert anythingimul
), which is very fastEdit: we can add a relevant comparison to show the difference using a non-native layout makes. The simplest case is storing values in something other than native width.
Using godbolt again, we can compare a simple native multiplication
unsigned mult (unsigned x, unsigned y)
{
return x*y;
}
mult(unsigned int, unsigned int):
mov eax, edi
imul eax, esi
ret
with the equivalent code for a non-standard width
struct pair {
unsigned x : 31;
unsigned y : 31;
};
unsigned mult (pair p)
{
return p.x*p.y;
}
mult(pair):
mov eax, edi
shr rdi, 32
and eax, 2147483647
and edi, 2147483647
imul eax, edi
ret
All the extra instructions are concerned with converting the input format (two 31-bit unsigned integers) into the format the processor can handle natively. If we wanted to store the result back into a 31-bit value, there would be another one or two instructions to do this.
This extra complexity means you'd only bother with this when the space saving is very important. In this case we're only saving two bits compared to using the native unsigned
or uint32_t
type, which would have generated much simpler code.
The example above is still fixed-width values rather than variable-width, but the width (and alignment) no longer match the native registers.
The x86 platform has several native sizes, including 8-bit and 16-bit in addition to the main 32-bit (I'm glossing over 64-bit mode and various other things for simplicity).
These types (char, int8_t, uint8_t, int16_t etc.) are also directly supported by the architecture - partly for backwards compatibility with older 8086/286/386/etc. etc. instruction sets.
It's certainly the case that choosing the smallest natural fixed-size type that will suffice, can be good practice - they're still quick, single instructions loads and stores, you still get full-speed native arithmetic, and you can even improve performance by reducing cache misses.
This is very different to variable-length encoding - I've worked with some of these, and they're horrible. Every load becomes a loop instead of a single instruction. Every store is also a loop. Every structure is variable-length, so you can't use arrays naturally.
In subsequent comments, you've been using the word "efficient", as far as I can tell with respect to storage size. We do sometimes choose to minimize storage size - it can be important when we're saving very large numbers of values to files, or sending them over a network. The trade-off is that we need to load those values into registers to do anything with them, and performing the conversion isn't free.
When we discuss efficiency, we need to know what we're optimizing, and what the trade-offs are. Using non-native storage types is one way to trade processing speed for space, and sometimes makes sense. Using variable-length storage (for arithmetic types at least), trades more processing speed (and code complexity and developer time) for an often-minimal further saving of space.
The speed penalty you pay for this means it's only worthwhile when you need to absolutely minimize bandwidth or long-term storage, and for those cases it's usually easier to use a simple and natural format - and then just compress it with a general-purpose system (like zip, gzip, bzip2, xy or whatever).
Each platform has one architecture, but you can come up with an essentially unlimited number of different ways to represent data. It's not reasonable for any language to provide an unlimited number of built-in data types. So, C++ provides implicit access the platform's native, natural set of data types, and allows you to code any other (non-native) representation yourself.
Why does a type have only one size associated with it when the space required to represent the value might be smaller than that size?
Primarily because of alignment requirements.
As per basic.align/1:
Object types have alignment requirements which place restrictions on the addresses at which an object of that type may be allocated.
Think of a building that has many floors and each floor has many rooms.
Each room is your size (a fixed space) capable of holding N amount of people or objects.
With the room size known beforehand, it makes the structural component of the building well-structured.
If the rooms are not aligned, then the building skeleton won't be well-structured.
Java uses classes called "BigInteger" and "BigDecimal" to do exactly this, as does C++'s GMP C++ class interface apparently (thanks Digital Trauma). You can easily do it yourself in pretty much any language if you want.
CPUs have always had the ability to use BCD (Binary Coded Decimal) which is designed to support operations of any length (but you tend to manually operate on one byte at a time which would be SLOW by today's GPU standards.)
The reason we don't use these or other similar solutions? Performance. Your most highly performant languages can't afford to go expanding a variable in the middle of some tight loop operation--it would be very non-deterministic.
In mass storage and transport situations, packed values are often the ONLY type of value you would use. For example, a music/video packet being streamed to your computer might spend a bit to specify if the next value is 2 bytes or 4 bytes as a size optimization.
Once it's on your computer where it can be used though, memory is cheap but the speed and complication of resizable variables is not.. that's really the only reason.
There are pretty substantial runtime performance benefits from doing this. If you were to operate on variable size types, you would have to decode each number before doing the operation (machine code instructions are typically fixed width), do the operation, then find a space in memory big enough to hold the result. Those are very difficult operations. It's much easier to simply store all of the data slightly inefficiently.
This is not always how it is done. Consider Google's Protobuf protocol. Protobufs are designed to transmit data very efficiently. Decreasing the number of bytes transmitted is worth the cost of additional instructions when operating on the data. Accordingly, protobufs use an encoding which encodes integers in 1, 2, 3, 4, or 5 bytes, and smaller integers take fewer bytes. Once the message is received, however, it is unpacked into a more traditional fixed-size integer format which is easier to operate on. It's only during network transmission that they use such a space-efficient variable length integer.