Shell: Find Matching Lines Across Many Files

前端未结

关注

 4  971

-上瘾入骨i 2021-01-03 02:43

I am trying to use a shell script (well a \"one liner\") to find any common lines between around 50 files. Edit: Note I am looking for a line (lines) that a

4条回答

栀梦 (楼主)

2021-01-03 03:46

old, bash answer (O(n); opens `2 * n` files)

From @mjgpy3 answer, you just have to make a for loop and use comm, like this:

#!/bin/bash

tmp1="/tmp/tmp1$RANDOM"
tmp2="/tmp/tmp2$RANDOM"

cp "$1" "$tmp1"
shift
for file in "$@"
do
    comm -1 -2 "$tmp1" "$file" > "$tmp2"
    mv "$tmp2" "$tmp1"
done
cat "$tmp1"
rm "$tmp1"

Save in a comm.sh, make it executable, and call

./comm.sh *.sp

assuming all your filenames end with .sp.

Updated answer, python, opens only each file once

Looking at the other answers, I wanted to give one that opens once each file without using any temporary file, and supports duplicated lines. Additionally, let's process the files in parallel.

Here you go (in python3):

#!/bin/env python
import argparse
import sys
import multiprocessing
import os

EOLS = {'native': os.linesep.encode('ascii'), 'unix': b'\n', 'windows': b'\r\n'}

def extract_set(filename):
    with open(filename, 'rb') as f:
        return set(line.rstrip(b'\r\n') for line in f)

def find_common_lines(filenames):
    pool = multiprocessing.Pool()
    line_sets = pool.map(extract_set, filenames)
    return set.intersection(*line_sets)

if __name__ == '__main__':
    # usage info and argument parsing
    parser = argparse.ArgumentParser()
    parser.add_argument("in_files", nargs='+', 
            help="find common lines in these files")
    parser.add_argument('--out', type=argparse.FileType('wb'),
            help="the output file (default stdout)")
    parser.add_argument('--eol-style', choices=EOLS.keys(), default='native',
            help="(default: native)")
    args = parser.parse_args()

    # actual stuff
    common_lines = find_common_lines(args.in_files)

    # write results to output
    to_print = EOLS[args.eol_style].join(common_lines)
    if args.out is None:
        # find out stdout's encoding, utf-8 if absent
        encoding = sys.stdout.encoding or 'utf-8'
        sys.stdout.write(to_print.decode(encoding))
    else:
        args.out.write(to_print)

Save it into a find_common_lines.py, and call

python ./find_common_lines.py *.sp

More usage info with the --help option.

0 讨论(0)

查看其它4个回答

Shell: Find Matching Lines Across Many Files

old, bash answer (O(n); opens 2 * n files)

Updated answer, python, opens only each file once

old, bash answer (O(n); opens `2 * n` files)