Text difference algorithm

后端未结

关注

 11  2081

I need an algorithm that can compare two text files and highlight their difference and ( even better!) can compute their difference in a meaningful way (like two similar fi

相关标签:

11条回答

太阳男子

2020-11-27 11:22

If you need a finer granularity than lines, you can use Levenshtein distance. Levenshtein distance is a straight-forward measure on how to similar two texts are.
You can also use it to extract the edit logs and can a very fine-grained diff, similar to that on the edit history pages of SO. Be warned though that Levenshtein distance can be quite CPU- and memory-intensive to calculate, so using difflib,as Douglas Leder suggested, is most likely going to be faster.

Cf. also this answer.

0 讨论(0)
发布评论:

提交评论
- 加载中...
借酒劲吻你

2020-11-27 11:24

There are a number of distance metrics, as paradoja mentioned there is the Levenshtein distance, but there is also NYSIIS and Soundex. In terms of Python implementations, I have used py-editdist and ADVAS before. Both are nice in the sense that you get a single number back as a score. Check out ADVAS first, it implements a bunch of algorithms.

0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-11-27 11:25

My current understanding is that the best solution to the Shortest Edit Script (SES) problem is Myers "middle-snake" method with the Hirschberg linear space refinement.

The Myers algorithm is described in:

E. Myers, ``An O(ND) Difference Algorithm and Its Variations,''
Algorithmica 1, 2 (1986), 251-266.

The GNU diff utility uses the Myers algorithm.

The "similarity score" you speak of is called the "edit distance" in the literature which is the number of inserts or deletes necessary to transform one sequence into the other.

Note that a number of people have cited the Levenshtein distance algorithm but that is, albeit easy to implement, not the optimal solution as it is inefficient (requires the use of a possibly huge n*m matrix) and does not provide the "edit script" which is the sequence of edits that could be used to transform one sequence into the other and vice versa.

For a good Myers / Hirschberg implementation look at:

http://www.ioplex.com/~miallen/libmba/dl/src/diff.c

The particular library that it is contained within is no longer maintained but to my knowledge the diff.c module itself is still correct.

Mike

0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2020-11-27 11:29

I can recommend to take a look at Neil Fraser's code and articles:

google-diff-match-patch

Currently available in Java, JavaScript, C++ and Python. Regardless of language, each library features the same API and the same functionality. All versions also have comprehensive test harnesses.

Neil Fraser: Diff Strategies - for theory and implementation notes

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-11-27 11:31
In Python, there is difflib, as also others have suggested.

difflib offers the SequenceMatcher class, which can be used to give you a similarity ratio. Example function:
```
def text_compare(text1, text2, isjunk=None):
    return difflib.SequenceMatcher(isjunk, text1, text2).ratio()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
误落风尘

2020-11-27 11:33

Take a look at the Fuzzy module. It has fast (written in C) based algorithms for soundex, NYSIIS and double-metaphone.

A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页