diff files comparing only first n characters of each line

后端未结

关注

 3  778

I have got 2 files. Let us call them md5s1.txt and md5s2.txt. Both contain the output of a

find -type f -print0 | xargs -0 md5sum | sort > md5s.txt

相关标签:

3条回答

萌比男神i

2021-02-19 06:09
If you are looking for duplicate files fdupes can do this for you:
```
$ fdupes --recurse
```
On ubuntu you can install it by doing
```
$ apt-get install fdupes
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
攒了一身酷

2021-02-19 06:27
Easy starter:
```
diff <(cut -d' ' -f1 md5s1.txt)  <(cut -d' ' -f1 md5s2.txt)
```
Also, consider just
```
diff -EwburqN folder1/ folder2/
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

闹比i

2021-02-19 06:33

Compare only the md5 column using diff on <(cut -c -32 md5sums.sort.XXX), and tell diff to print just the line numbers of added or removed lines, using --old/new-line-format='%dn'$'\n'. Pipe this into ed md5sums.sort.XXX so it will print only those lines from the md5sums.sort.XXX file.

diff \
    --new-line-format='%dn'$'\n' \
    --old-line-format='' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.new \
    > files-added
diff \
    --new-line-format='' \
    --old-line-format='%dn'$'\n' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.old \
    > files-removed

The problem with ed is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. Instead of piping the output of diff into ed, pipe it into the following command, which will use much less memory.

diff … | (
    lnum=0;
    while read lprint; do
        while [ $lnum -lt $lprint ]; do read line <&3; ((lnum++)); done;
        echo $line;
    done
) 3<md5sums.sort.XXX

0 讨论(0)