What are the problems of a zero-terminated string that length-prefixed strings overcome?
I was reading the book Write Great Code vol. 1 and I had that question in mind.<
One problem is that with zero-terminated strings you have to keep finding the end of the string repeatedly. The classic example where this is inefficient is concatenating into a buffer:
char buf[1024] = "first";
strcat(buf, "second");
strcat(buf, "third");
strcat(buf, "fourth");
On every call to strcat
the program has to start from the beginning of the string and find the terminator to know where to start appending. This means the function spends more and more time finding the place to append as the string grows longer.
With a length-prefixed string the equivalent of the strcat
function would know where the end is immediately, and would just update the length after appending to it.
There are pros and cons to each way of representing strings and whether they cause problems for you depend on what you are doing with strings, and which operations need to be efficient. The problem described above can be overcome by manually keeping track of the end of the string as it grows, so by changing the code you can avoid the performance cost.
One problem is that you can not store null characters (value zero) in a zero terminated string. This makes it impossible to store some character encodings as well as encrypted data.
Length-prefixed strings do not suffer that limitation.
It is best summarized in The Most Expensive One-byte Mistake by Poul-Henning Kamp.
Security: Marco A. already hit this pretty hard. Over and under-running string buffers is still a major route for attacks by hackers.
Compiler Development Costs: Big costs are associated with optimizing compilers for null terminating strings that would have been easier with the address and length format.
Hardware Development Costs: Hardware development costs are also large for string specific instructions associated with null terminating strings.
What are the problems of a zero-terminated string that length-prefixed strings overcome?
None whatsoever.
It's just eye candy.
Length-prefixed strings have, as part of their structure, information on how long the string is. If you want to do the same with zero-terminated strings you can use a helper variable;
lpstring = "foobar"; // saves '6' somewhere "inside" lpstring
ztstring = "foobar";
ztlength = 6; // saves '6' in a helper variable
Lots of C library functions work with zero-terminated strings and cannot use anything past the '\0'
byte. That's an issue with the functions themselves, not the string structure. If you need functions which deal with zero-terminated strings with embedded zeroes, write your own.
A few more bonus features that can be implemented with length-prefixed strings:
It's possible to have multiple styles of length prefix, identifiable through one or more bits of the first byte identified by the string pointer/reference. In exchange for a little extra time determining string length, one could e.g. use a single-byte prefix for short strings and longer prefixes for longer strings. If one uses a lot of 1-3 byte strings that could save more than 50% on overall memory consumption for such strings compared with using a fixed four-byte prefix; such a format could also accommodate strings whose length exceeded the range of 32-bit integers.
One may store variable-length strings within bounds-checked buffers at a cost of only one or two bits in the length prefix. The number N combined with the other bits would indicate one of three things:
An N-byte string
(Optional) An N-byte buffer holding a zero-length string
An N-byte buffer which, if its last byte B is less than 248, holds a string of length N-B-1; if the 248 or more, the preceding B-247 bytes would store the difference between the buffer size and the string length. Note that if the length of the string is precisely N-1, the string will be followed by a NUL byte, and if it's less than that the byte following the string will be unused and could be set to NUL.
Using such an approach, one would need to initialize strong buffers before use (to indicate their length), but would then no longer need to pass the length of a string buffer to a routine that was going to store data there.
One may use certain prefix values to indicate various special things. For example, one may have a prefix that indicates that it is not followed by a string, but rather by a string-data pointer and two integers giving buffer size and current length. If methods that operate on strings call a method to get the data pointer, buffer size, and length, one may pass such a method a reference to a portion of a string cheaply provided that the string itself will outlive the method call.
One may extend the above feature with a bit to indicate that the string data is in a region that was generated by malloc
and may be resized if needed; additionally, one could safely have methods that sometimes return a dynamically-generated string allocated on the heap, and sometimes return an immutable static string, and have the recipient perform a "free this string if it isn't static".
I don't know if any prefixed-string implementations implement all those bonus features, but they can all be accommodated for very little cost in storage space, relatively little cost in code, and less cost in time than would be required to use NUL-terminated strings whose length was neither known nor short.
First a clarification: C++ strings (i.e. std::string
) aren't weren't required to end with zero until C++11. They always provided access to a zero-terminated C string though.
C-style strings end with a 0 character for historical reasons.
The problems you're referring to are mainly bound to security issues: zero ended strings need to have a zero terminator. If they lack it (for whatever reason), the string's length becomes unreliable and they can lead to buffer overrun problems (which a malicious attacker can exploit by writing arbitrary data in places where it shouldn't be.. DEP helps in mitigating these issues but it's off-topic here).