Shell: Find Matching Lines Across Many Files

前端 未结 4 971
-上瘾入骨i
-上瘾入骨i 2021-01-03 02:43

I am trying to use a shell script (well a \"one liner\") to find any common lines between around 50 files. Edit: Note I am looking for a line (lines) that a

4条回答
  •  栀梦
    栀梦 (楼主)
    2021-01-03 03:46

    old, bash answer (O(n); opens 2 * n files)

    From @mjgpy3 answer, you just have to make a for loop and use comm, like this:

    #!/bin/bash
    
    tmp1="/tmp/tmp1$RANDOM"
    tmp2="/tmp/tmp2$RANDOM"
    
    cp "$1" "$tmp1"
    shift
    for file in "$@"
    do
        comm -1 -2 "$tmp1" "$file" > "$tmp2"
        mv "$tmp2" "$tmp1"
    done
    cat "$tmp1"
    rm "$tmp1"
    

    Save in a comm.sh, make it executable, and call

    ./comm.sh *.sp 
    

    assuming all your filenames end with .sp.

    Updated answer, python, opens only each file once

    Looking at the other answers, I wanted to give one that opens once each file without using any temporary file, and supports duplicated lines. Additionally, let's process the files in parallel.

    Here you go (in python3):

    #!/bin/env python
    import argparse
    import sys
    import multiprocessing
    import os
    
    EOLS = {'native': os.linesep.encode('ascii'), 'unix': b'\n', 'windows': b'\r\n'}
    
    def extract_set(filename):
        with open(filename, 'rb') as f:
            return set(line.rstrip(b'\r\n') for line in f)
    
    def find_common_lines(filenames):
        pool = multiprocessing.Pool()
        line_sets = pool.map(extract_set, filenames)
        return set.intersection(*line_sets)
    
    if __name__ == '__main__':
        # usage info and argument parsing
        parser = argparse.ArgumentParser()
        parser.add_argument("in_files", nargs='+', 
                help="find common lines in these files")
        parser.add_argument('--out', type=argparse.FileType('wb'),
                help="the output file (default stdout)")
        parser.add_argument('--eol-style', choices=EOLS.keys(), default='native',
                help="(default: native)")
        args = parser.parse_args()
    
        # actual stuff
        common_lines = find_common_lines(args.in_files)
    
        # write results to output
        to_print = EOLS[args.eol_style].join(common_lines)
        if args.out is None:
            # find out stdout's encoding, utf-8 if absent
            encoding = sys.stdout.encoding or 'utf-8'
            sys.stdout.write(to_print.decode(encoding))
        else:
            args.out.write(to_print)
    

    Save it into a find_common_lines.py, and call

    python ./find_common_lines.py *.sp
    

    More usage info with the --help option.

提交回复
热议问题