I\'m writing a language interpreter in C, and my string
type contains a length
attribute, like so:
struct String
{
char* charac
You're absolutely right that 0-termination is a method which is poor with respect to type checking and performance for part of the operations. The answers on this page already summarize the origins and uses for it.
I liked the way Delphi stored strings. I believe it maintains a length/maxlength in before the (variable length) string. This way the strings can be null-terminated for compatibility.
My concerns with your mechanism: - additional pointer - immutability si in the core parts of your language; normally string types are not immutable so if you ever reconsider than it'll be tough. You'd need to implement a 'create copy on change' mechanism - use of malloc (hardly efficient, but may be included here just for ease?)
Good luck; writing your own interpreter can be very educational in understanding mainly the grammar and syntax of programming languages! (at least, it ws for me)
The usual solution is to do both - keep the length and maintain the null terminator. It's not much extra work and means that you are always ready to pass the string to any function.
Null-terminated strings are often a drain on performance, for the obvious reason that the time taken to discover the length depends on the length. On the plus side, they are the standard way of representing strings in C, so you have little choice but to support them if you want to use most C libraries.
One advantage of nul-terminated strings is that if you are walking through a string character-by-character, you only need to keep a single pointer to address the string:
while (*s)
{
*s = toupper(*s);
s++;
}
whereas for strings without sentinels, you need to keep two bits of state around: either a pointer and index:
while (i < s.length)
{
s.data[i] = toupper(s.data[i]);
i++;
}
...or a current pointer and a limit:
s_end = s + length;
while (s < s_end)
{
*s = toupper(*s);
s++;
}
When CPU registers were a scarce resource (and compilers were worse at allocating them), this was important. Now, not so much.
Although I prefer the array + len method in most cases, there are valid reasons for using null-terminated.
Take a 32-bit system.
To store a 7 byte string
char * + size_t + 8 bytes = 19 bytes
To store a 7 byte null-term string
char * + 8 = 16 bytes.
null-term arrays don't need to be immutable like your strings do. I can happily truncate the c-string by simply places a null char. If you code, you would need to create a new string, which involves allocating memory.
Depending on the usage of the strings, your strings will never be able to match the performance possible with c-strings as opposed to your strings.
Slightly offtopic, but there's a more efficient way to do length-prefixed strings than the way you describe. Create a struct like this (valid in C99 and up):
struct String
{
size_t length;
char characters[0];
}
This creates a struct that has the length at the start, with the 'characters' element usable as a char* just as you would with your current struct. The difference, however, is that you can allocate only a single item on the heap for each string, instead of two. Allocate your strings like this:
mystr = malloc(sizeof(String) + strlen(cstring))
Eg - the length of the struct (which is just the size_t) plus enough space to put the actual string after it.
If you don't want to use C99, you can also do this with "char characters[1]" and subtract 1 from the length of the string to allocate.
Lengths have their problems too.
The length takes extra storage (not such an issue now, but a big factor 30 years ago).
Every time you alter a string you have to update the length, so you get reduced performance across the board.
With a NUL-terminated string you can still use a length or store a pointer to the last character, so if you are doing lots of string manipulations, you can still equal the performance of string-with-length.
NUL-terminated strings are much simpler - The NUL terminator is just a convention used by methods like strcat
to determine the end of the string. So you can store them in a regular char array rather than having to use a struct.