Deleting lines from one file which are in another file

后端 未结 9 585
余生分开走
余生分开走 2020-11-28 01:46

I have a file f1:

line1
line2
line3
line4
..
..

I want to delete all the lines which are in another file f2:

相关标签:
9条回答
  • 2020-11-28 02:02

    For exclude files that aren't too huge, you can use AWK's associative arrays.

    awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt 
    

    The output will be in the same order as the "from-this.txt" file. The tolower() function makes it case-insensitive, if you need that.

    The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)

    0 讨论(0)
  • 2020-11-28 02:03

    Seems to be a job suitable for the SQLite shell:

    create table file1(line text);
    create index if1 on file1(line ASC);
    create table file2(line text);
    create index if2 on file2(line ASC);
    -- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
    .import 'file1.txt' file1
    .import 'file2.txt' file2
    .output result.txt
    select * from file2 where line not in (select line from file1);
    .q
    
    0 讨论(0)
  • 2020-11-28 02:06

    Some timing comparisons between various other answers:

    $ for n in {1..10000}; do echo $RANDOM; done > f1
    $ for n in {1..10000}; do echo $RANDOM; done > f2
    $ time comm -23 <(sort f1) <(sort f2) > /dev/null
    
    real    0m0.019s
    user    0m0.023s
    sys     0m0.012s
    $ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null
    
    real    0m0.026s
    user    0m0.018s
    sys     0m0.007s
    $ time grep -xvf f2 f1 > /dev/null
    
    real    0m43.197s
    user    0m43.155s
    sys     0m0.040s
    

    sort f1 f2 | uniq -u isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.

    comm can also be used with stdin and here strings:

    echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a
    
    0 讨论(0)
  • 2020-11-28 02:13

    if you have Ruby (1.9+)

    #!/usr/bin/env ruby 
    b=File.read("file2").split
    open("file1").each do |x|
      x.chomp!
      puts x if !b.include?(x)
    end
    

    Which has O(N^2) complexity. If you want to care about performance, here's another version

    b=File.read("file2").split
    a=File.read("file1").split
    (a-b).each {|x| puts x}
    

    which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)

    here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:

    $ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
    $ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
    $ time ruby test.rb > ruby.test
    
    real    0m0.639s
    user    0m0.554s
    sys     0m0.021s
    
    $time sort file1 file2|uniq -u  > sort.test
    
    real    0m2.311s
    user    0m1.959s
    sys     0m0.040s
    
    $ diff <(sort -n ruby.test) <(sort -n sort.test)
    $
    

    diff was used to show there are no differences between the 2 files generated.

    0 讨论(0)
  • 2020-11-28 02:15

    grep -v -x -f f2 f1 should do the trick.

    Explanation:

    • -v to select non-matching lines
    • -x to match whole lines only
    • -f f2 to get patterns from f2

    One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).

    0 讨论(0)
  • 2020-11-28 02:18

    Try comm instead (assuming f1 and f2 are "already sorted")

    comm -2 -3 f1 f2
    
    0 讨论(0)
提交回复
热议问题