Efficient incremental hash computation during program interpretation

问题

I'd like to write a recursively memoizing Scheme interpreter. At any point during evaluation, the interpreter should be able to detect when it receives as arguments a pair of expression and environment that it has previously seen.

Plain memoization of eval and apply is inefficient. It would require looking up the arguments in a hash table on every call of eval/apply, which would require walking the entire (possibly big) arguments on hash table matches.

For example, assume that the interpreter evaluates the program

(car (list A))

where A evaluates to a big object. When the interpreter evaluates the application (list A), it first evaluates list and A individually. Before it applies list to A, it looks up in its hash table whether it has seen this application before, walking the entire A object to compute a hash. Later on, when the memoizing interpreter applies car to the list containing A, it computes a hash for this list which again involves walking the entire A object.

Instead, I want to build an interpreter that incrementally builds up approximately unique hashes, avoiding recomputation where possible and providing a guarantee that collisions are unlikely.

For example, one could recursively extend each object that the interpreter operates on with the MD5 of its value, or, if it is a compound object, with the MD5 of its component hashes. An environment might store the hash for each of its variable/value entries, and the hash of the environment might be computed as a function of the individual hashes. Then, if an entry in the environment changes, it is not necessary to rewalk the entire environment to compute the new hash of the environment. Instead, only the hash of the changed variable/value pair needs to be recomputed and the global hash of the set of entry hashes needs to be updated.

Does there exist related work on incrementally building up approximately unique hashes, in particular in the context of recursive memoization and/or program evaluation?

回答1:

Note that if expressions are immutable (no self-modifying code allowed) then you can use EQ equality on them. If environments are immutable, you can treat them likewise. EQ equality is fast since you're just taking the bits from the machine pointer to be a hash.

The problem then are assignments which mutate environments, causing expressions values to change. If they are allowed, how do deal with this.

One way would be to make a note of environments that contain destructive code in their lexical scopes and somehow annotate them so that the evaluator can recognize such "polluted environments" and not do the caching for them.

By the way, you obviously want hash tables with weak semantics for this so that any objects that become garbage do not pile up in memory.

来源：https://stackoverflow.com/questions/10308058/efficient-incremental-hash-computation-during-program-interpretation

标签

hash

lisp

scheme

interpreter

memoization