I need an algorithm that can compare two text files and highlight their difference and ( even better!) can compute their difference in a meaningful way (like two similar fi
If you need a finer granularity than lines, you can use Levenshtein distance. Levenshtein distance is a straight-forward measure on how to similar two texts are.
You can also use it to extract the edit logs and can a very fine-grained diff, similar to that on the edit history pages of SO.
Be warned though that Levenshtein distance can be quite CPU- and memory-intensive to calculate, so using difflib,as Douglas Leder suggested, is most likely going to be faster.
Cf. also this answer.
There are a number of distance metrics, as paradoja mentioned there is the Levenshtein distance, but there is also NYSIIS and Soundex. In terms of Python implementations, I have used py-editdist and ADVAS before. Both are nice in the sense that you get a single number back as a score. Check out ADVAS first, it implements a bunch of algorithms.
My current understanding is that the best solution to the Shortest Edit Script (SES) problem is Myers "middle-snake" method with the Hirschberg linear space refinement.
The Myers algorithm is described in:
E. Myers, ``An O(ND) Difference Algorithm and Its Variations,''
Algorithmica 1, 2 (1986), 251-266.
The GNU diff utility uses the Myers algorithm.
The "similarity score" you speak of is called the "edit distance" in the literature which is the number of inserts or deletes necessary to transform one sequence into the other.
Note that a number of people have cited the Levenshtein distance algorithm but that is, albeit easy to implement, not the optimal solution as it is inefficient (requires the use of a possibly huge n*m matrix) and does not provide the "edit script" which is the sequence of edits that could be used to transform one sequence into the other and vice versa.
For a good Myers / Hirschberg implementation look at:
http://www.ioplex.com/~miallen/libmba/dl/src/diff.c
The particular library that it is contained within is no longer maintained but to my knowledge the diff.c module itself is still correct.
Mike
I can recommend to take a look at Neil Fraser's code and articles:
google-diff-match-patch
Currently available in Java, JavaScript, C++ and Python. Regardless of language, each library features the same API and the same functionality. All versions also have comprehensive test harnesses.
Neil Fraser: Diff Strategies - for theory and implementation notes
In Python, there is difflib, as also others have suggested.
difflib
offers the SequenceMatcher class, which can be used to give you a similarity ratio. Example function:
def text_compare(text1, text2, isjunk=None):
return difflib.SequenceMatcher(isjunk, text1, text2).ratio()
Take a look at the Fuzzy module. It has fast (written in C) based algorithms for soundex, NYSIIS and double-metaphone.
A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528