How can I provide garbage collection for an interpreted language implemented in C?

前端 未结 3 581
后悔当初
后悔当初 2021-01-20 04:28

If I were to implement a garbage collected interpreted language in C, how can I go about providing precise (i.e. not conservative) garbage collection without writing my own

相关标签:
3条回答
  • 2021-01-20 05:11

    If you want a precise GC (not a conservative one, like Boehm's GC, which performs quite well in practice) you should track local pointer (to GC-ed data) variables, or else invoke the GC only with a nearly empty call stack when you are sure there are no such local variables (btw, the GCC compiler has such a mark&sweep garbage collector - with marking routines generated by some specialized gengtype C++ code generator; that GGC is invoked only between passes). Of course you should also track global (including static, or thread local) pointer (to GC-ed data) variables also.

    Alternatively, have some bytecode virtual machine (like OCaml or NekoVM have), then the local GC-ed variables are those in the stacks and/or registers of your bytecode VM, and you trigger your GC at specific and carefully chosen points of your VM interpreter. (See this explanation of Ocaml GC).

    You should read more about Garbage Collection techniques, see the GC handbook.

    If your GC is copying generational, you need to implement write barriers (to handle mutation of old data pointing to new zone). You could use my old Qish GC (which I don't maintain much anymore), or Ravenbrook's MPS, or write your own generational copying GC (this is not that hard in theory, but debugging GCs is a nightmare in practice, so it is a lot of work).

    You probably want to use some macro tricks (like my Qish does) to help keeping your local variables. See the Living in harmony with the garbage collector section of Ocaml documentation as an example (or look inside Qish).

    Notice that a generational copying GC is not friendly to handle in manually written C code (because you need to explicitly keep local pointers, and because you need a write barrier to remember when an old value is modified to have a pointer to the new generation). If you want to do that, your C code should be in A-normal form (you cannot code x=f(g(y),z); but you need to code temp=g(y); x=f(temp,z); and add temp as a local variable, assuming that x, y, z are local GC-ed variables and that both f and g return a GC-ed pointer). In practice it is much easier to generate the C code. See my MELT domain specific language (to extend and customize GCC) as an example.

    If your language is genuinely multithreaded (several mutator threads allocating in parallel), then coding a GC becomes quite tricky. It might take several months of work (and is probably a nightmare to debug).

    Actually, I would today recommend using Boehm's GC (notice that it is multithread friendly). A naive mark&sweep handcoded GC would probably not be faster than Boehm's GC. And you won't be able (and I don't recommend) to use GGC, the garbage collector internal to GCC (which, IMNSHO, is not very good; it was a dirty hack design many years ago).

    BTW, you might consider customizing -e.g. with MELT- the GCC compiler (by adding some application specific __attribute__ or #pragma) to help your GC. With some work, you could generate some of the marking routines, etc. However, that approach might be quite painful (I really don't know). Notice that MELT (free software, GPLv3+) contains a copying generational GC whose old generation is the GGC heap, so you could at least look inside the code of melt-runtime.cc

    PS. I also recommend Queinnec's book: Lisp In Small Pieces; it has some interesting material about GC and their connection to programming languages, and it is a great book to read when you are implementing an interpreter. Scott's book on Programming Languages Pragmatics is also worth reading.

    0 讨论(0)
  • 2021-01-20 05:17

    For C programs, there are 2 options: the Boehm GC which replaces malloc (it is a conservative GC, so perhaps not exactly what you're looking for but it's either that or...), or write your own.

    But writing your own isn't that hard. Do the mark-sweep algorithm. The root set for marking will be your symbol table. And you'll need another table or linked-list to track all of the allocated memory that can be freed. When you sweep through the list of allocations, free anything without a mark.

    The actual coding will of course be more complicated because you have to iterate through these two kinds of data structures, but the algorithm itself is very simple. You can do it.


    A few years ago, I found myself on the same search and these were (and AFAIK still are) the results. Writing your own is tremendously rewarding and worthwhile.

    In practice, a great many other issues will arise as Basile's answer touches upon.

    If the garbage collector is called from deep in the call-stack (by an allocation routine that needs more memory, perhaps), then care must be taken about any allocations whose handles are still held in local variables of C functions in the call stack, and not saved-out to their symbol-table or database locations. In my postscript interpreter I dealt with this by using a temporary stack which all allocators pushed to. This stack was cleared by the main loop after all subroutines had returned, and it was considered part of the root set during marking. In my APL interpreter, I call the GC everytime around the mainloop. For the little programs in little languages, the speed issues are less vital than the more-dreaded memory leakage, at least among the circles which have influenced me.

    0 讨论(0)
  • 2021-01-20 05:31

    When implementing such a language, your interpreter needs to keep track of all objects in the program it's running, including knowledge of their types and what part of the data is a reference to other data. Then it's trivial for you to walk all the data and implement whatever sort of garbage collector you like. No bogus hacks like trying to determine where the C implementation's "heap"/"stack"/etc. are located or guessing at what might be a pointer is needed, because you're dealing exactly with the data whose structure you know.

    0 讨论(0)
提交回复
热议问题