Implementations might differ between the actual sizes of types, but on most, types like unsigned int and float are always 4 bytes. But why does a type always occupy a certai
It can be less. Consider the function:
int foo()
{
int bar = 1;
int baz = 42;
return bar+baz;
}
it compiles to assembly code (g++, x64, details stripped)
$43, %eax
ret
Here, bar
and baz
end up using zero bytes to represent.
There are objects that in some sense have variable size, in the C++ standard library, such as std::vector
. However, these all dynamically allocate the extra memory they will need. If you take sizeof(std::vector<int>)
, you will get a constant that has nothing to do with the memory managed by the object, and if you allocate an array or structure containing std::vector<int>
, it will reserve this base size rather than putting the extra storage in the same array or structure. There are a few pieces of C syntax that support something like this, notably variable-length arrays and structures, but C++ did not choose to support them.
The language standard defines object size that way so that compilers can generate efficient code. For example, if int
happens to be 4 bytes long on some implementation, and you declare a
as a pointer to or array of int
values, then a[i]
translates into the pseudocode, “dereference the address a + 4×i.” This can be done in constant time, and is such a common and important operation that many instruction-set architectures, including x86 and the DEC PDP machines on which C was originally developed, can do it in a single machine instruction.
One common real-world example of data stored consecutively as variable-length units is strings encoded as UTF-8. (However, the underlying type of a UTF-8 string to the compiler is still char
and has width 1. This allows ASCII strings to be interpreted as valid UTF-8, and a lot of library code such as strlen()
and strncpy()
to continue to work.) The encoding of any UTF-8 codepoint can be one to four bytes long, and therefore, if you want the fifth UTF-8 codepoint in a string, it could begin anywhere from the fifth byte to the seventeenth byte of the data. The only way to find it is to scan from the beginning of the string and check the size of each codepoint. If you want to find the fifth grapheme, you also need to check the character classes. If you wanted to find the millionth UTF-8 character in a string, you’d need to run this loop a million times! If you know you will need to work with indices often, you can traverse the string once and build an index of it—or you can convert to a fixed-width encoding, such as UCS-4. Finding the millionth UCS-4 character in a string is just a matter of adding four million to the address of the array.
Another complication with variable-length data is that, when you allocate it, you either need to allocate as much memory as it could ever possibly use, or else dynamically reallocate as needed. Allocating for the worst case could be extremely wasteful. If you need a consecutive block of memory, reallocating could force you to copy all the data over to a different location, but allowing the memory to be stored in non-consecutive chunks complicates the program logic.
So, it’s possible to have variable-length bignums instead of fixed-width short int
, int
, long int
and long long int
, but it would be inefficient to allocate and use them. Additionally, all mainstream CPUs are designed to do arithmetic on fixed-width registers, and none have instructions that directly operate on some kind of variable-length bignum. Those would need to be implemented in software, much more slowly.
In the real world, most (but not all) programmers have decided that the benefits of UTF-8 encoding, especially compatibility, are important, and that we so rarely care about anything other than scanning a string from front to back or copying blocks of memory that the drawbacks of variable width are acceptable. We could use packed, variable-width elements similar to UTF-8 for other things. But we very rarely do, and they aren’t in the standard library.
There are a few reasons. One is the added complexity for handling arbitrary-sized numbers and the performance hit this gives because the compiler can no longer optimize based on the assumption that every int is exactly X bytes long.
A second one is that storing simple types this way means they need an additional byte to hold the length. So, a value of 255 or less actually needs two bytes in this new system, not one, and in the worst case you now need 5 bytes instead of 4. This means that the performance win in terms of memory used is less than you might think and in some edge cases might actually be a net loss.
A third reason is that computer memory is generally addressable in words, not bytes. (But see footnote). Words are a multiple of bytes, usually 4 on 32-bit systems and 8 on 64 bit systems. You usually can't read an individual byte, you read a word and extract the nth byte from that word. This means both that extracting individual bytes from a word takes a bit more effort than just reading the entire word and that it is very efficient if the entire memory is evenly divided in word-sized (ie, 4-byte sized) chunks. Because, if you have arbitrary sized integers floating around, you might end up with one part of the integer being in one word, and another in the next word, necessitating two reads to get the full integer.
Footnote: To be more precise, while you addressed in bytes, most systems ignored the 'uneven' bytes. Ie, address 0, 1, 2 and 3 all read the same word, 4, 5, 6 and 7 read the next word, and so on.
On an unreleated note, this is also why 32-bit systems had a max of 4 GB memory. The registers used to address locations in memory are usually large enough to hold a word, ie 4 bytes, which has a max value of (2^32)-1 = 4294967295. 4294967296 bytes is 4 GB.
so why would myInt not just occupy 1 byte of memory?
Because you told it to use that much. When using an unsigned int
, some standards dictate that 4 bytes will be used and that the available range for it will be from 0 to 4,294,967,295. If you were to use an unsigned char
instead, you would probably only be using the 1 byte that you're looking for, (depending on the standard and C++ normally uses these standards).
If it weren't for these standards you'd have to keep this in mind: how is the compiler or CPU supposed to know to only use 1 byte instead of 4? Later on in your program you might add or multiply that value, which would require more space. Whenever you make a memory allocation, the OS has to find, map, and give you that space, (potentially swapping memory to virtual RAM as well); this can take a long time. If you allocate the memory before hand, you won't have to wait for another allocation to be completed.
As for the reason why we use 8 bits per byte, you can take a look at this: What is the history of why bytes are eight bits?
On a side note, you could allow the integer to overflow; but should you use a signed integer, the C\C++ standards state that integer overflows result in undefined behavior. Integer overflow
Because in a language like C++, a design goal is that simple operations compile down to simple machine instructions.
All mainstream CPU instruction sets work with fixed-width types, and if you want to do variable-width types, you have to do multiple machine instructions to handle them.
As for why the underlying computer hardware is that way: It's because it's simpler, and more efficient for many cases (but not all).
Imagine the computer as a piece of tape:
| xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | xx | ...
If you simply tell the computer to look at the first byte on the tape, xx
, how does it know whether or not the type stops there, or proceeds on to the next byte? If you have a number like 255
(hexadecimal FF
) or a number like 65535
(hexadecimal FFFF
) the first byte is always FF
.
So how do you know? You have to add additional logic, and "overload" the meaning of at least one bit or byte value to indicate that the value continues to the next byte. That logic is never "free", either you emulate it in software or you add a bunch of additional transistors to the CPU to do it.
The fixed-width types of languages like C and C++ reflect that.
It doesn't have to be this way, and more abstract languages which are less concerned with mapping to maximally efficient code are free to use variable-width encodings (also known as "Variable Length Quantities" or VLQ) for numeric types.
Further reading: If you search for "variable length quantity" you can find some examples of where that kind of encoding is actually efficient and worth the additional logic. It's usually when you need to store a huge amount of values which might be anywhere within a large range, but most values tend towards some small sub-range.
Note that if a compiler can prove that it can get away with storing the value in a smaller amount of space without breaking any code (for example it's a variable only visible internally within a single translation unit), and its optimization heuristics suggest that it'll be more efficient on the target hardware, it's entirely allowed to optimize it accordingly and store it in a smaller amount of space, so long as the rest of the code works "as if" it did the standard thing.
But, when the code has to inter-operate with other code that might be compiled separately, sizes have to stay consistent, or ensure that every piece of code follows the same convention.
Because if it's not consistent, there's this complication: What if I have int x = 255;
but then later in the code I do x = y
? If int
could be variable-width, the compiler would have to know ahead of time to pre-allocate the maximum amount of space it'll need. That's not always possible, because what if y
is an argument passed in from another piece of code that's compiled separately?
The compiler is allowed to make a lot of changes to your code, as long as things still work (the "as-is" rule).
It would be possible to use a 8-bit literal move instruction instead of the longer (32/64 bit) required to move a full int
. However, you would need two instructions to complete the load, since you would have to set the register to zero first before doing the load.
It is simply more efficient (at least according to the main compilers) to handle the value as 32 bit. Actually, I've yet to see a x86/x86_64 compiler that would do 8-bit load without inline assembly.
However, things are different when it comes to 64 bit. When designing the previous extensions (from 16 to 32 bit) of their processors, Intel made a mistake. Here is a good representation of what they look like. The main takeaway here is that when you write to AL or AH, the other is not affected (fair enough, that was the point and it made sense back then). But it gets interesting when they expanded it to 32 bits. If you write the bottom bits (AL, AH or AX), nothing happens to the upper 16 bits of EAX, which means that if you want to promote a char
into a int
, you need to clear that memory first, but you have no way of actually using only these top 16 bits, making this "feature" more a pain than anything.
Now with 64 bits, AMD did a much better job. If you touch anything in the lower 32 bits, the upper 32 bits are simply set to 0. This leads to some actual optimizations that you can see in this godbolt. You can see that loading something of 8 bits or 32 bits is done the same way, but when you use 64 bits variables, the compiler uses a different instruction depending on the actual size of your literal.
So you can see here, compilers can totally change the actual size of your variable inside the CPU if it would produce the same result, but it makes no sense to do so for smaller types.