How can I implement a string data type in LLVM?

匿名 (未验证) 提交于 2019-12-03 02:45:02

问题:

I have been looking at LLVM lately, and I find it to be quite an interesting architecture. However, looking through the tutorial and the reference material, I can't see any examples of how I might implement a string data type.

There is a lot of documentation about integers, reals, and other number types, and even arrays, functions and structures, but AFAIK nothing about strings. Would I have to add a new data type to the backend? Is there a way to use built-in data types? Any insight would be appreciated.

回答1:

What is a string? An array of characters.

What is a character? An integer.

So while I'm no LLVM expert by any means, I would guess that if, eg, you wanted to represent some 8-bit character set, you'd use an array of i8 (8-bit integers), or a pointer to i8. And indeed, if we have a simple hello world C program:

#include <stdio.h>  int main() {         puts("Hello, world!");         return 0; } 

And we compile it using llvm-gcc and dump the generated LLVM assembly:

$ llvm-gcc -S -emit-llvm hello.c $ cat hello.s ; ModuleID = 'hello.c' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128" target triple = "x86_64-linux-gnu" @.str = internal constant [14 x i8] c"Hello, world!\00"         ; <[14 x i8]*> [#uses=1]  define i32 @main() { entry:         %retval = alloca i32            ; <i32*> [#uses=2]         %tmp = alloca i32               ; <i32*> [#uses=2]         %"alloca point" = bitcast i32 0 to i32          ; <i32> [#uses=0]         %tmp1 = getelementptr [14 x i8]* @.str, i32 0, i64 0            ; <i8*> [#uses=1]         %tmp2 = call i32 @puts( i8* %tmp1 ) nounwind            ; <i32> [#uses=0]         store i32 0, i32* %tmp, align 4         %tmp3 = load i32* %tmp, align 4         ; <i32> [#uses=1]         store i32 %tmp3, i32* %retval, align 4         br label %return  return:         ; preds = %entry         %retval4 = load i32* %retval            ; <i32> [#uses=1]         ret i32 %retval4 }  declare i32 @puts(i8*) 

Notice the reference to the puts function declared at the end of the file. In C, puts is

int puts(const char *s) 

In LLVM, it is

i32 @puts(i8*) 

The correspondence should be clear.

As an aside, the generated LLVM is very verbose here because I compiled without optimizations. If you turn those on, the unnecessary instructions disappear:

$ llvm-gcc -O2 -S -emit-llvm hello.c $ cat hello.s  ; ModuleID = 'hello.c' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128" target triple = "x86_64-linux-gnu" @.str = internal constant [14 x i8] c"Hello, world!\00"         ; <[14 x i8]*> [#uses=1]  define i32 @main() nounwind  { entry:         %tmp2 = tail call i32 @puts( i8* getelementptr ([14 x i8]* @.str, i32 0, i64 0) ) nounwind              ; <i32> [#uses=0]         ret i32 0 }  declare i32 @puts(i8*) 


回答2:

[To follow up on other answers which explain what strings are, here is some implementation help]

Using the C interface, the calls you'll want are something like:

LLVMValueRef llvmGenLocalStringVar(const char* data, int len) {   LLVMValueRef glob = LLVMAddGlobal(mod, LLVMArrayType(LLVMInt8Type(), len), "string");    // set as internal linkage and constant   LLVMSetLinkage(glob, LLVMInternalLinkage);   LLVMSetGlobalConstant(glob, TRUE);    // Initialize with string:   LLVMSetInitializer(glob, LLVMConstString(data, len, TRUE));    return glob; } 


回答3:

Think about how a string is represented in common languages:

  • C: a pointer to a character. You don't have to do anything special.
  • C++: string is a complex object with a constructor, destructor, and copy constructor. On the inside, it usually holds essentially a C string.
  • Java/C#/...: a string is a complex object holding an array of characters.

LLVM's name is very self explanatory. It really is "low level". You have to implement strings how ever you want them to be. It would be silly for LLVM to force anyone into a specific implementation.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!