If I were to implement a garbage collected interpreted language in C, how can I go about providing precise (i.e. not conservative) garbage collection without writing my own
If you want a precise GC (not a conservative one, like Boehm's GC, which performs quite well in practice) you should track local pointer (to GC-ed data) variables, or else invoke the GC only with a nearly empty call stack when you are sure there are no such local variables (btw, the GCC compiler has such a mark&sweep garbage collector - with marking routines generated by some specialized gengtype
C++ code generator; that GGC is invoked only between passes). Of course you should also track global (including static, or thread local) pointer (to GC-ed data) variables also.
Alternatively, have some bytecode virtual machine (like OCaml or NekoVM have), then the local GC-ed variables are those in the stacks and/or registers of your bytecode VM, and you trigger your GC at specific and carefully chosen points of your VM interpreter. (See this explanation of Ocaml GC).
You should read more about Garbage Collection techniques, see the GC handbook.
If your GC is copying generational, you need to implement write barriers (to handle mutation of old data pointing to new zone). You could use my old Qish GC (which I don't maintain much anymore), or Ravenbrook's MPS, or write your own generational copying GC (this is not that hard in theory, but debugging GCs is a nightmare in practice, so it is a lot of work).
You probably want to use some macro tricks (like my Qish does) to help keeping your local variables. See the Living in harmony with the garbage collector section of Ocaml documentation as an example (or look inside Qish).
Notice that a generational copying GC is not friendly to handle in manually written C code (because you need to explicitly keep local pointers, and because you need a write barrier to remember when an old value is modified to have a pointer to the new generation). If you want to do that, your C code should be in A-normal form (you cannot code x=f(g(y),z);
but you need to code temp=g(y); x=f(temp,z);
and add temp
as a local variable, assuming that x
, y
, z
are local GC-ed variables and that both f
and g
return a GC-ed pointer). In practice it is much easier to generate the C code. See my MELT domain specific language (to extend and customize GCC) as an example.
If your language is genuinely multithreaded (several mutator threads allocating in parallel), then coding a GC becomes quite tricky. It might take several months of work (and is probably a nightmare to debug).
Actually, I would today recommend using Boehm's GC (notice that it is multithread friendly). A naive mark&sweep handcoded GC would probably not be faster than Boehm's GC. And you won't be able (and I don't recommend) to use GGC, the garbage collector internal to GCC (which, IMNSHO, is not very good; it was a dirty hack design many years ago).
BTW, you might consider customizing -e.g. with MELT- the GCC compiler (by adding some application specific __attribute__
or #pragma
) to help your GC. With some work, you could generate some of the marking routines, etc. However, that approach might be quite painful (I really don't know). Notice that MELT (free software, GPLv3+) contains a copying generational GC whose old generation is the GGC heap, so you could at least look inside the code of melt-runtime.cc
PS. I also recommend Queinnec's book: Lisp In Small Pieces; it has some interesting material about GC and their connection to programming languages, and it is a great book to read when you are implementing an interpreter. Scott's book on Programming Languages Pragmatics is also worth reading.