Implementations might differ between the actual sizes of types, but on most, types like unsigned int and float are always 4 bytes. But why does a type always occupy a certai
There are objects that in some sense have variable size, in the C++ standard library, such as std::vector
. However, these all dynamically allocate the extra memory they will need. If you take sizeof(std::vector
, you will get a constant that has nothing to do with the memory managed by the object, and if you allocate an array or structure containing std::vector
, it will reserve this base size rather than putting the extra storage in the same array or structure. There are a few pieces of C syntax that support something like this, notably variable-length arrays and structures, but C++ did not choose to support them.
The language standard defines object size that way so that compilers can generate efficient code. For example, if int
happens to be 4 bytes long on some implementation, and you declare a
as a pointer to or array of int
values, then a[i]
translates into the pseudocode, “dereference the address a + 4×i.” This can be done in constant time, and is such a common and important operation that many instruction-set architectures, including x86 and the DEC PDP machines on which C was originally developed, can do it in a single machine instruction.
One common real-world example of data stored consecutively as variable-length units is strings encoded as UTF-8. (However, the underlying type of a UTF-8 string to the compiler is still char
and has width 1. This allows ASCII strings to be interpreted as valid UTF-8, and a lot of library code such as strlen()
and strncpy()
to continue to work.) The encoding of any UTF-8 codepoint can be one to four bytes long, and therefore, if you want the fifth UTF-8 codepoint in a string, it could begin anywhere from the fifth byte to the seventeenth byte of the data. The only way to find it is to scan from the beginning of the string and check the size of each codepoint. If you want to find the fifth grapheme, you also need to check the character classes. If you wanted to find the millionth UTF-8 character in a string, you’d need to run this loop a million times! If you know you will need to work with indices often, you can traverse the string once and build an index of it—or you can convert to a fixed-width encoding, such as UCS-4. Finding the millionth UCS-4 character in a string is just a matter of adding four million to the address of the array.
Another complication with variable-length data is that, when you allocate it, you either need to allocate as much memory as it could ever possibly use, or else dynamically reallocate as needed. Allocating for the worst case could be extremely wasteful. If you need a consecutive block of memory, reallocating could force you to copy all the data over to a different location, but allowing the memory to be stored in non-consecutive chunks complicates the program logic.
So, it’s possible to have variable-length bignums instead of fixed-width short int
, int
, long int
and long long int
, but it would be inefficient to allocate and use them. Additionally, all mainstream CPUs are designed to do arithmetic on fixed-width registers, and none have instructions that directly operate on some kind of variable-length bignum. Those would need to be implemented in software, much more slowly.
In the real world, most (but not all) programmers have decided that the benefits of UTF-8 encoding, especially compatibility, are important, and that we so rarely care about anything other than scanning a string from front to back or copying blocks of memory that the drawbacks of variable width are acceptable. We could use packed, variable-width elements similar to UTF-8 for other things. But we very rarely do, and they aren’t in the standard library.