How to count differences between two files on linux?

后端未结

关注

 7  542

I need to work with large files and must find differences between two. And I don\'t need the different bits, but the number of differences.

To find the number of dif

相关标签:

7条回答

孤街浪徒

2020-12-23 13:57
If you want to count the number of lines that are different use this:
```
diff -U 0 file1 file2 | grep ^@ | wc -l
```
Doesn't John's answer double count the different lines?
0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2020-12-23 13:58
```
diff -U 0 file1 file2 | grep -v ^@ | wc -l
```
That minus 2 for the two file names at the top of the diff listing. Unified format is probably a bit faster than side-by-side format.
0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2020-12-23 14:07
Since every output line that differs starts with < or > character, I would suggest this:
```
diff file1 file2 | grep ^[\>\<] | wc -l
```
By using only \< or \> in the script line you can count differences only in one of the files.
0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-12-23 14:13

If using Linux/Unix, what about comm -1 file1 file2 to print lines in file1 that aren't in file2, comm -1 file1 file2 | wc -l to count them, and similarly for comm -2 ...?

0 讨论(0)
发布评论:

提交评论
- 加载中...

粉色の甜心

2020-12-23 14:13

Here is a way to count any kind of differences between two files, with specified regex for those differences - here . for any character except newline:

git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l

An excerpt from man git-diff :

--patience
           Generate a diff using the "patience diff" algorithm.
--word-diff[=<mode>]
           Show a word diff, using the <mode> to delimit changed words. By default, words are delimited by whitespace; see --word-diff-regex below.
           porcelain
               Use a special line-based format intended for script consumption. Added/removed/unchanged runs are printed in the usual unified diff
               format, starting with a +/-/` ` character at the beginning of the line and extending to the end of the line. Newlines in the input
               are represented by a tilde ~ on a line of its own.
--word-diff-regex=<regex>
           Use <regex> to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it
           was already enabled.
           Every non-overlapping match of the <regex> is considered a word. Anything between these matches is considered whitespace and ignored(!)
           for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches
           all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline.
           For example, --word-diff-regex=.  will treat each character as a word and, correspondingly, show differences character by character.

pcre2grep is part of pcre2-utils package on Ubuntu 20.04.

0 讨论(0)

爱一瞬间的悲伤

2020-12-23 14:14
I believe the correct solution is in this answer, that is:
```
$ diff -y --suppress-common-lines a b | grep '^' | wc -l
1
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页