Comparing two files in linux terminal

后端 未结 10 1139
一生所求
一生所求 2020-12-22 17:31

There are two files called \"a.txt\" and \"b.txt\" both have a list of words. Now I want to check which words are extra in \"a.txt\

相关标签:
10条回答
  • 2020-12-22 17:42

    You can also use: colordiff: Displays the output of diff with colors.

    About vimdiff: It allows you to compare files via SSH, for example :

    vimdiff /var/log/secure scp://192.168.1.25/var/log/secure
    

    Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

    0 讨论(0)
  • 2020-12-22 17:43

    Use comm -13 (requires sorted files):

    $ cat file1
    one
    two
    three
    
    $ cat file2
    one
    two
    three
    four
    
    $ comm -13 <(sort file1) <(sort file2)
    four
    
    0 讨论(0)
  • 2020-12-22 17:43

    Using awk for it. Test files:

    $ cat a.txt
    one
    two
    three
    four
    four
    $ cat b.txt
    three
    two
    one
    

    The awk:

    $ awk '
    NR==FNR {                    # process b.txt  or the first file
        seen[$0]                 # hash words to hash seen
        next                     # next word in b.txt
    }                            # process a.txt  or all files after the first
    !($0 in seen)' b.txt a.txt   # if word is not hashed to seen, output it
    

    Duplicates are outputed:

    four
    four
    

    To avoid duplicates, add each newly met word in a.txt to seen hash:

    $ awk '
    NR==FNR {
        seen[$0]
        next
    }
    !($0 in seen) {              # if word is not hashed to seen
        seen[$0]                 # hash unseen a.txt words to seen to avoid duplicates 
        print                    # and output it
    }' b.txt a.txt
    

    Output:

    four
    

    If the word lists are comma-separated, like:

    $ cat a.txt
    four,four,three,three,two,one
    five,six
    $ cat b.txt
    one,two,three
    

    you have to do a couple of extra laps (forloops):

    awk -F, '                    # comma-separated input
    NR==FNR {
        for(i=1;i<=NF;i++)       # loop all comma-separated fields
            seen[$i]
        next
    }
    {
        for(i=1;i<=NF;i++)
            if(!($i in seen)) {
                 seen[$i]        # this time we buffer output (below):
                 buffer=buffer (buffer==""?"":",") $i
            }
        if(buffer!="") {         # output unempty buffers after each record in a.txt
            print buffer
            buffer=""
        }
    }' b.txt a.txt
    

    Output this time:

    four
    five,six
    
    0 讨论(0)
  • 2020-12-22 17:45

    if you have vim installed,try this:

    vimdiff file1 file2
    

    or

    vim -d file1 file2
    

    you will find it fantastic.enter image description here

    0 讨论(0)
  • 2020-12-22 17:51

    Sort them and use comm:

    comm -23 <(sort a.txt) <(sort b.txt)
    

    comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.

    0 讨论(0)
  • 2020-12-22 17:52

    If you prefer the diff output style from git diff, you can use it with the --no-index flag to compare files not in a git repository:

    git diff --no-index a.txt b.txt
    

    Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:

    git diff --no-index a.txt b.txt
    # ~1.2s
    
    comm -23 <(sort a.txt) <(sort b.txt)
    # ~0.2s
    
    diff a.txt b.txt
    # ~2.6s
    
    sdiff a.txt b.txt
    # ~2.7s
    
    vimdiff a.txt b.txt
    # ~3.2s
    

    comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.


    Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:

    This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.

    0 讨论(0)
提交回复
热议问题