问题
I two large files (27k lines and 450k lines). They look sort of like:
File1:
1 2 A 5
3 2 B 7
6 3 C 8
...
File2:
4 2 C 5
7 2 B 7
6 8 B 8
7 7 F 9
...
I want the lines from both files in which the 3rd column is in both files (note lines with A and F were excluded):
OUTPUT:
3 2 B 7
6 3 C 8
4 2 C 5
7 2 B 7
6 8 B 8
whats the best way?
回答1:
awk '{print $3}' file1 | sort | uniq > file1col3
awk '{print $3}' file2 | sort | uniq > file2col3
grep -Fx -f file1col3 file2col3 | awk '{print "\\w+ \\w+ " $1 " \\w+"}' > col3regexp
egrep -xh -f col3regexp file1 file2
Grabs all the unique column 3's in the two files, intersects them (using grep -F
), prints a bunch of regular expressions that will match the columns you want, then uses egrep
to extract them from the two files.
回答2:
first we sort the files on the third field :
sort -k 3 file1 > file1.sorted
sort -k 3 file2 > file2.sorted
then we get common values on the 3rd field using comm :
comm -12 <(cut -d " " -f 3 file1.sorted | uniq) <(cut -d " " -f 3 file2.sorted | uniq) > common_values.field
now we can join each sorted file on the common values :
join -1 3 -o '1.1,1.2,1.3,1.4' file1.sorted common_values.field > file.joined
join -1 3 -o '1.1,1.2,1.3,1.4' file2.sorted common_values.field >> file.joined
output is formated so we get the same field order as the one used in the files.
Standard unix tools used : sort, comm, cut, uniq, join.
The <( )
works with bash, for other shells you might use temp files instead.
回答3:
Here's an option using grep, sed and cut.
Extract column 3:
cut -d' ' -f3 file1 > f1c
cut -d' ' -f3 file2 > f2c
Find matching lines in file1
:
grep -nFf f2c f1c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file1 > out
Find matching lines in file2
:
grep -nFf f1c f2c | cut -d: -f1 | sed 's/$/p/' | sed -n -f - file2 >> out
Output:
3 2 B 7
6 3 C 8
4 2 C 5
7 2 B 7
6 8 B 8
Update
If you have asymmetric data files and the smaller one fits into memory, this one-pass awk solution would be pretty efficient:
parse.awk
FNR == NR {
a[$3] = $0
p[$3] = 1
next
}
a[$3]
p[$3] {
print a[$3]
delete p[$3]
}
Run it like this:
awk -f parse.awk file1 file2
Where file1
is the smaller of the two.
Explanation
- The
FNR == NR
block readsfile1
into two hashes. a[$3]
printsfile2
line if$3
is a key ina
.p[$3]
printsfile1
line if$3
is a key inp
and deletes the key (only print once).
回答4:
First obtain the common values from the third column. Then filter the lines from both files that have a matching third column.
If the columns are delimited by a single character, you can use cut
to extract one column. For columns that can be separated by an arbitrary amount of whitespace, use awk
. One way to obtain the common column 3 values is to extract the columns, sort them and call comm. Using bash/ksh/zsh process substitutions:
comm -12 <(awk '{print $3}' file1 | sort -u) <(awk '{print $3}' file2 | sort -u)
Now turn these into grep
patterns, and filter.
comm -12 <(awk '{print $3}' file1 | sort -u) <(awk '{print $3}' file2 | sort -u) |
sed -e 's/[][.\|?*+^$]/\\&/g' \
-e 's/.*/^[^[:space]]+[[:space]]+[^[:space]]+[[:space]]+\1[[:space]]/' |
grep -E -f - file1 file2
The method above should work reasonably well with huge files. But at 500k lines, you don't have huge files. Those files should fit comfortably in memory, and a simple Perl solution will be fine. Load both files, extract the columns values, print the matching columns.
perl -n -e '
@lines += 1;
$c = (split)[2];
$seen{$c}{$ARGV} = 1;
END {
foreach (@lines) {
$c = (split)[2];
print if %{$seen{$c}} == 2;
}
}' file1 file2
来源:https://stackoverflow.com/questions/12443110/intersection-of-files