Shell: Find Matching Lines Across Many Files

前端未结

关注

 4  970

I am trying to use a shell script (well a \"one liner\") to find any common lines between around 50 files. Edit: Note I am looking for a line (lines) that a

相关标签:

4条回答

盖世英雄少女心

2021-01-03 03:23
Combining this two answers (ans1 and ans2) I think you can get the result you are needing without sorting the files:
```
#!/bin/bash
ans="matching_lines"

for file1 in *
do 
    for file2 in *
        do 
            if  [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
                echo "Comparing: $file1 $file2 ..." >> $ans
                perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/' $file1 $file2 >> $ans
            fi
         done 
done
```
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.

Things to be improved:
- Skip directories
- Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
- Maybe add the line number next to the matching string
Hope this helps.

Best,

Alan Karpovsky
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2021-01-03 03:41
When I first read this I thought you were trying to find 'any common lines'. I took this as meaning "find duplicate lines". If this is the case, the following should suffice:
```
sort *.sp | uniq -d
```
Upon re-reading your question, it seems that you are actually trying to find lines that 'appear in all the files'. If this is the case, you will need to know the number of files in your directory:
```
find . -type f -name "*.sp" | wc -l
```
If this returns the number 50, you can then use awk like this:
```
WHINY_USERS=1 awk '{ array[$0]++ } END { for (i in array) if (array[i] == 50) print i }' *.sp
```
You can consolidate this process and write a one-liner like this:
```
WHINY_USERS=1 awk -v find=$(find . -type f -name "*.sp" | wc -l) '{ array[$0]++ } END { for (i in array) if (array[i] == find) print i }' *.sp
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

栀梦

2021-01-03 03:46

old, bash answer (O(n); opens `2 * n` files)

From @mjgpy3 answer, you just have to make a for loop and use comm, like this:

#!/bin/bash

tmp1="/tmp/tmp1$RANDOM"
tmp2="/tmp/tmp2$RANDOM"

cp "$1" "$tmp1"
shift
for file in "$@"
do
    comm -1 -2 "$tmp1" "$file" > "$tmp2"
    mv "$tmp2" "$tmp1"
done
cat "$tmp1"
rm "$tmp1"

Save in a comm.sh, make it executable, and call

./comm.sh *.sp

assuming all your filenames end with .sp.

Updated answer, python, opens only each file once

Looking at the other answers, I wanted to give one that opens once each file without using any temporary file, and supports duplicated lines. Additionally, let's process the files in parallel.

Here you go (in python3):

#!/bin/env python
import argparse
import sys
import multiprocessing
import os

EOLS = {'native': os.linesep.encode('ascii'), 'unix': b'\n', 'windows': b'\r\n'}

def extract_set(filename):
    with open(filename, 'rb') as f:
        return set(line.rstrip(b'\r\n') for line in f)

def find_common_lines(filenames):
    pool = multiprocessing.Pool()
    line_sets = pool.map(extract_set, filenames)
    return set.intersection(*line_sets)

if __name__ == '__main__':
    # usage info and argument parsing
    parser = argparse.ArgumentParser()
    parser.add_argument("in_files", nargs='+', 
            help="find common lines in these files")
    parser.add_argument('--out', type=argparse.FileType('wb'),
            help="the output file (default stdout)")
    parser.add_argument('--eol-style', choices=EOLS.keys(), default='native',
            help="(default: native)")
    args = parser.parse_args()

    # actual stuff
    common_lines = find_common_lines(args.in_files)

    # write results to output
    to_print = EOLS[args.eol_style].join(common_lines)
    if args.out is None:
        # find out stdout's encoding, utf-8 if absent
        encoding = sys.stdout.encoding or 'utf-8'
        sys.stdout.write(to_print.decode(encoding))
    else:
        args.out.write(to_print)

Save it into a find_common_lines.py, and call

python ./find_common_lines.py *.sp

More usage info with the --help option.

0 讨论(0)

醉话见心

2021-01-03 03:48

See this answer. I originally though a diff sounded like what you were asking for, but this answer seems much more appropriate.

0 讨论(0)
发布评论:

提交评论
- 加载中...

Shell: Find Matching Lines Across Many Files

old, bash answer (O(n); opens 2 * n files)

Updated answer, python, opens only each file once

old, bash answer (O(n); opens `2 * n` files)