I have been looking at LLVM lately, and I find it to be quite an interesting architecture. However, looking through the tutorial and the reference material, I can't see any examples of how I might implement a string data type.
There is a lot of documentation about integers, reals, and other number types, and even arrays, functions and structures, but AFAIK nothing about strings. Would I have to add a new data type to the backend? Is there a way to use built-in data types? Any insight would be appreciated.
What is a string? An array of characters.
What is a character? An integer.
So while I'm no LLVM expert by any means, I would guess that if, eg, you wanted to represent some 8-bit character set, you'd use an array of i8 (8-bit integers), or a pointer to i8. And indeed, if we have a simple hello world C program:
#include <stdio.h> int main() { puts("Hello, world!"); return 0; }
And we compile it using llvm-gcc and dump the generated LLVM assembly:
$ llvm-gcc -S -emit-llvm hello.c $ cat hello.s ; ModuleID = 'hello.c' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128" target triple = "x86_64-linux-gnu" @.str = internal constant [14 x i8] c"Hello, world!\00" ; <[14 x i8]*> [#uses=1] define i32 @main() { entry: %retval = alloca i32 ; <i32*> [#uses=2] %tmp = alloca i32 ; <i32*> [#uses=2] %"alloca point" = bitcast i32 0 to i32 ; <i32> [#uses=0] %tmp1 = getelementptr [14 x i8]* @.str, i32 0, i64 0 ; <i8*> [#uses=1] %tmp2 = call i32 @puts( i8* %tmp1 ) nounwind ; <i32> [#uses=0] store i32 0, i32* %tmp, align 4 %tmp3 = load i32* %tmp, align 4 ; <i32> [#uses=1] store i32 %tmp3, i32* %retval, align 4 br label %return return: ; preds = %entry %retval4 = load i32* %retval ; <i32> [#uses=1] ret i32 %retval4 } declare i32 @puts(i8*)
Notice the reference to the puts function declared at the end of the file. In C, puts is
int puts(const char *s)
In LLVM, it is
i32 @puts(i8*)
The correspondence should be clear.
As an aside, the generated LLVM is very verbose here because I compiled without optimizations. If you turn those on, the unnecessary instructions disappear:
$ llvm-gcc -O2 -S -emit-llvm hello.c $ cat hello.s ; ModuleID = 'hello.c' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128" target triple = "x86_64-linux-gnu" @.str = internal constant [14 x i8] c"Hello, world!\00" ; <[14 x i8]*> [#uses=1] define i32 @main() nounwind { entry: %tmp2 = tail call i32 @puts( i8* getelementptr ([14 x i8]* @.str, i32 0, i64 0) ) nounwind ; <i32> [#uses=0] ret i32 0 } declare i32 @puts(i8*)
[To follow up on other answers which explain what strings are, here is some implementation help]
Using the C interface, the calls you'll want are something like:
LLVMValueRef llvmGenLocalStringVar(const char* data, int len) { LLVMValueRef glob = LLVMAddGlobal(mod, LLVMArrayType(LLVMInt8Type(), len), "string"); // set as internal linkage and constant LLVMSetLinkage(glob, LLVMInternalLinkage); LLVMSetGlobalConstant(glob, TRUE); // Initialize with string: LLVMSetInitializer(glob, LLVMConstString(data, len, TRUE)); return glob; }
Think about how a string is represented in common languages:
- C: a pointer to a character. You don't have to do anything special.
- C++:
string
is a complex object with a constructor, destructor, and copy constructor. On the inside, it usually holds essentially a C string. - Java/C#/...: a string is a complex object holding an array of characters.
LLVM's name is very self explanatory. It really is "low level". You have to implement strings how ever you want them to be. It would be silly for LLVM to force anyone into a specific implementation.