问题
I'm working in an online editor for a datatype that consists of nested lists of strings. Note that traffic can get unbearable if I am going to transfer the entire structure every time a single value is changed. So, in order to reduce traffic, I've thought in applying a diff tool. Problem is: how do I find and report the diff of two trees? For example:
["ah","bh",["ha","he",["li","no","pz"],"ka",["kat","xe"]],"po","xi"] ->
["ah","bh",["ha","he",["li","no","pz"],"ka",["rag","xe"]],"po","xi"]
There, the only change is "kat" -> "rag"
deep down on the tree. Most of the diff tools around work for flat lists, files, etc, but not trees. I couldn't find any literature on that specific problem. What is the minimal way to report such change, and what is an efficient algorithm to find it out?
回答1:
XML is a tree-like data structure in common use, often used to describe structured documents or other hierarchical objects whose changes over time need to be monitored. So it should be unsurprising that most of the recent work in tree diffing has been in the context of XML.
Here's a 2006 survey with a lot of possibly useful links: Change Detection in XML Trees
One of the more interesting links from the above, which was accompanied by an open source implementation called TreePatch, but now seems to be defunct: Kyriakos Komvoteas' thesis
Another survey article, by Daniel Ehrenberg, with a bunch more references. (That one comes from a question on http://cstheory.stackexchange.com)
Good luck.
回答2:
Finding the difference between two trees looks kind of like searching in the tree. The only difference that you know you will have to get to the bottom of both of them. You could search through both trees simultaneously, and when you hit the difference, change one to another one ( if that is your goal - to end up with identical trees, without sending one over every time).
Some links that I've found on diff'ing 2 trees:
How can i diff two trees to determine parental changes?
Detect differences between tree structures
Diff algorithms
Hope that those links will be useful to you. :)
回答3:
- You can use any general DIFF algorithm, it is not a problem to find ready to use library.
- If you can use ZLIB library, I can suggest another solution. With some trick it is possible to use this library to send very compressed difference between two any binaries, let call them A and B (and difference Bc).
Side 1:
- Init ZLIB stream
- Compress A->Ac with Z_SNC_FLUSH (we don’t need result, so Ac can be freed)
- Compress B->Bc with Z_SNC_FLUSH
- Deinit ZLIB stream
We compress block A first with special flag which force ZLib to process and output all data. But it doesn’t reset compression state! When we compress block B compressor already knows subsequences of A and will compress block B very efficiently (if they have a lot in common). Bc is the only data to send.
Side 2:
- Init ZLIB stream
- Compress A->Ac with Z_SNC_FLUSH
- Deinit ZLIB stream
We need to decompress exactly same blocks as we compressed. That it why we need Ac.
- Init ZLIB stream again
- DeCompress Ac->A with Z_SNC_FLUSH
- DeCompress Bc->B with Z_SNC_FLUSH
- Deinit ZLIB stream
Now we can decompress Ac-A (we have to, because we did it on other side and it helps to decompressor to learn all subsequences of block A) and finally Bc->B.
It is a bit unusual and tricky usage of ZLib, but Bc in this case is not just compressed block B, it is actually compressed difference between block A and B. It will be very efficient if size of ZLIB dictionary is comparable with size of block A. For huge blocks of data it will be not so efficient.
来源:https://stackoverflow.com/questions/19256028/how-to-correctly-diff-trees-that-is-nested-lists-of-strings