diff files comparing only first n characters of each line

后端 未结 3 778
日久生厌
日久生厌 2021-02-19 05:30

I have got 2 files. Let us call them md5s1.txt and md5s2.txt. Both contain the output of a

find -type f -print0 | xargs -0 md5sum | sort > md5s.txt

相关标签:
3条回答
  • 2021-02-19 06:09

    If you are looking for duplicate files fdupes can do this for you:

    $ fdupes --recurse
    

    On ubuntu you can install it by doing

    $ apt-get install fdupes
    
    0 讨论(0)
  • 2021-02-19 06:27

    Easy starter:

    diff <(cut -d' ' -f1 md5s1.txt)  <(cut -d' ' -f1 md5s2.txt)
    

    Also, consider just

    diff -EwburqN folder1/ folder2/
    
    0 讨论(0)
  • 2021-02-19 06:33

    Compare only the md5 column using diff on <(cut -c -32 md5sums.sort.XXX), and tell diff to print just the line numbers of added or removed lines, using --old/new-line-format='%dn'$'\n'. Pipe this into ed md5sums.sort.XXX so it will print only those lines from the md5sums.sort.XXX file.

    diff \
        --new-line-format='%dn'$'\n' \
        --old-line-format='' \
        --unchanged-line-format='' \
        <(cut -c -32 md5sums.sort.old) \
        <(cut -c -32 md5sums.sort.new) \
        | ed md5sums.sort.new \
        > files-added
    diff \
        --new-line-format='' \
        --old-line-format='%dn'$'\n' \
        --unchanged-line-format='' \
        <(cut -c -32 md5sums.sort.old) \
        <(cut -c -32 md5sums.sort.new) \
        | ed md5sums.sort.old \
        > files-removed
    

    The problem with ed is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. Instead of piping the output of diff into ed, pipe it into the following command, which will use much less memory.

    diff … | (
        lnum=0;
        while read lprint; do
            while [ $lnum -lt $lprint ]; do read line <&3; ((lnum++)); done;
            echo $line;
        done
    ) 3<md5sums.sort.XXX
    
    0 讨论(0)
提交回复
热议问题