问题
I'm writing a language interpreter in C, and my string
type contains a length
attribute, like so:
struct String
{
char* characters;
size_t length;
};
Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:
Bounds-checking is built-in if you use "length" instead of looking for a null.
You have to traverse the entire string to find its length.
You have to do extra stuff to handle a null character in the middle of a null-terminated string.
Null-terminated strings deal poorly with Unicode.
Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.
String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of begin
and end
to both functions).
struct String slice(struct String in, size_t begin, size_t end)
{
struct String out;
out.characters = in.characters + begin;
out.length = end - begin;
return out;
}
char* slice(char* in, size_t begin, size_t end)
{
char* out = malloc(end - begin + 1);
for(int i = 0; i < end - begin; i++)
out[i] = in[i + begin];
out[end - begin] = '\0';
return out;
}
After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!
So my question is: are there any benefits to null-termination that I'm missing?
回答1:
The usual solution is to do both - keep the length and maintain the null terminator. It's not much extra work and means that you are always ready to pass the string to any function.
Null-terminated strings are often a drain on performance, for the obvious reason that the time taken to discover the length depends on the length. On the plus side, they are the standard way of representing strings in C, so you have little choice but to support them if you want to use most C libraries.
回答2:
From Joel's Back to Basics:
Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."
Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.
回答3:
One benefit is that with null-termination any tail of a null-terminated string is also a null-terminated string. If you need to pass a substring starting with Nth character (provided there's no buffer overrun) into some string-handling function - no problem, just pass the offseeted address there. When storing size in some other way you would need to construct a new string.
回答4:
One advantage of nul-terminated strings is that if you are walking through a string character-by-character, you only need to keep a single pointer to address the string:
while (*s)
{
*s = toupper(*s);
s++;
}
whereas for strings without sentinels, you need to keep two bits of state around: either a pointer and index:
while (i < s.length)
{
s.data[i] = toupper(s.data[i]);
i++;
}
...or a current pointer and a limit:
s_end = s + length;
while (s < s_end)
{
*s = toupper(*s);
s++;
}
When CPU registers were a scarce resource (and compilers were worse at allocating them), this was important. Now, not so much.
回答5:
Slightly offtopic, but there's a more efficient way to do length-prefixed strings than the way you describe. Create a struct like this (valid in C99 and up):
struct String
{
size_t length;
char characters[0];
}
This creates a struct that has the length at the start, with the 'characters' element usable as a char* just as you would with your current struct. The difference, however, is that you can allocate only a single item on the heap for each string, instead of two. Allocate your strings like this:
mystr = malloc(sizeof(String) + strlen(cstring))
Eg - the length of the struct (which is just the size_t) plus enough space to put the actual string after it.
If you don't want to use C99, you can also do this with "char characters[1]" and subtract 1 from the length of the string to allocate.
回答6:
Lengths have their problems too.
The length takes extra storage (not such an issue now, but a big factor 30 years ago).
Every time you alter a string you have to update the length, so you get reduced performance across the board.
With a NUL-terminated string you can still use a length or store a pointer to the last character, so if you are doing lots of string manipulations, you can still equal the performance of string-with-length.
NUL-terminated strings are much simpler - The NUL terminator is just a convention used by methods like
strcat
to determine the end of the string. So you can store them in a regular char array rather than having to use a struct.
回答7:
Just throwing out some hypotheticals:
- there's no way to get a "wrong" implementation of null terminated strings. A standardized struct however could have vendor-specific implementations.
- no structs are required. Null terminated strings are "built-in" so to speak, by virtue of being a special case of a char*.
回答8:
Although I prefer the array + len method in most cases, there are valid reasons for using null-terminated.
Take a 32-bit system.
To store a 7 byte string
char * + size_t + 8 bytes = 19 bytes
To store a 7 byte null-term string
char * + 8 = 16 bytes.
null-term arrays don't need to be immutable like your strings do. I can happily truncate the c-string by simply places a null char. If you code, you would need to create a new string, which involves allocating memory.
Depending on the usage of the strings, your strings will never be able to match the performance possible with c-strings as opposed to your strings.
回答9:
You're absolutely right that 0-termination is a method which is poor with respect to type checking and performance for part of the operations. The answers on this page already summarize the origins and uses for it.
I liked the way Delphi stored strings. I believe it maintains a length/maxlength in before the (variable length) string. This way the strings can be null-terminated for compatibility.
My concerns with your mechanism: - additional pointer - immutability si in the core parts of your language; normally string types are not immutable so if you ever reconsider than it'll be tough. You'd need to implement a 'create copy on change' mechanism - use of malloc (hardly efficient, but may be included here just for ease?)
Good luck; writing your own interpreter can be very educational in understanding mainly the grammar and syntax of programming languages! (at least, it ws for me)
回答10:
I think main reason is that standard says nothing concrete about size of any type other than char. But sizeof(char) = 1 and that is definitely not enough for string size.
来源:https://stackoverflow.com/questions/1253291/why-null-terminated-strings-or-null-terminated-vs-characters-length-storage