I have tab delimited text files in which common rows between them are to be found based on columns 1 and 2 as key columns. Sample files:
file1.txt
aba 0 0
aba
Can't you simply use uniq
to search for repeated lines in you files?
Something like:
cat file1.txt file2.txt file3.txt | uniq -d
For your complete scenario, you could use uniq -c
to get the number of repetition for each line, and filter this with grep
.
I think you just need to modify the END
block a little, and the command invocation:
awk -v num_files=${x:-0} '
…
…script as before…
…
END {
if (num_files == 0) num_files = ARGC - 1
for (key in arr) {
if (arr[key] == num_files) {
split(line[key], line_arr, SUBSEP)
for (i = 1; i <= length(line_arr); i++) {
printf "%s\n", line_arr[i]
}
}
}
}
'
Basically, this takes a command line parameter based on $x
, defaulting to 0, and assigning it to the awk
variable num_files
. In the END
block, the code checks for num_files
being zero, and resets it to the number of files passed on the command line. (Interestingly, the value in ARGC
discounts any -v var=value
options and either a command line script or -f script.awk
, so the ARGC-1
term remains correct. The array ARGV
contains awk
(or whatever name you invoked it with) in ARGV[0]
and the files to be processed in ARGV[1]
through ARGV[ARGC-1]
.) The loop then checks for the required number of matches and prints as before. You can change ==
to >=
if you want the 'or more' option.
I observed in a comment:
I'm not clear what you are asking. I took it that your code was working for the example with three files and producing the right answer. I simply suggested how to modify the working code to handle N files and at least M of them sharing an entry. I have just realized, while typing this, that there is a bit more work to do. An entry could be missing from the first file but present in the others and will need to be processed, therefore. It is easy to report all occurrences in every file, or the first occurrence in any file. It is harder to report all occurrences only in the first file with a key.
The response was:
It is perfectly fine to report first occurrence in any file and need not be only from the first file. However, the issue with the suggested modification is, it is producing the same output for different values of
x
.
That's curious: I was able to get sane output from the amended code with different values for the number of files where the key must appear. I used this shell script. The code in the awk
program up to the END
block is the same as in the question; the only change is in the END processing block.
#!/bin/bash
while getopts n: opt
do
case "$opt" in
(n) num_files=$OPTARG;;
(*) echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
exit 1;;
esac
done
shift $(($OPTIND - 1))
awk -v num_files=${num_files:-$#} '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] (line[$1,$2] ? SUBSEP : "") $0
next
}
FNR == 1 { delete found }
{ if (arr[$1,$2] && ! found[$1,$2]) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
if (num_files == 0) num_files = ARGC - 1
for (key in arr) {
if (arr[key] == num_files) {
split(line[key], line_arr, SUBSEP)
for (i = 1; i <= length(line_arr); i++) {
printf "%s\n", line_arr[i]
}
}
}
}
' "$@"
Sample runs (data files from question):
$ bash common.sh file?.txt
xxx 0 0
aba 0 0
aba 0 0 1
$ bash common.sh -n 3 file?.txt
xxx 0 0
aba 0 0
aba 0 0 1
$ bash common.sh -n 2 file?.txt
$ bash common.sh -n 1 file?.txt
abc 0 1
abd 1 1
$
That shows different answers depending on the value specified via -n
. Note that this only shows lines that appear in the first file and appear in exactly N files in total. The only key that appears in two files (abc
/1
) does not appear in the first file, so it is not listed by this code which stops paying attention to new keys after the first file is processed.
However, here's a rewrite, using some of the same ideas, but working more thoroughly.
#!/bin/bash
# SO 30428099
# Given that the key for a line is the first two columns, this script
# lists all appearances in all files of a given key if that key appears
# in N different files (where N defaults to the number of files). For
# the benefit of debugging, it includes the file name and line number
# with each line.
usage()
{
echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
exit 1
}
while getopts n: opt
do
case "$opt" in
(n) num_files=$OPTARG;;
(*) usage;;
esac
done
shift $(($OPTIND - 1))
if [ "$#" = 0 ]
then usage
fi
# Record count of each key, regardless of file: keys
# Record count of each key in each file: key_file
# Count of different files containing each key: files
# Accumulate line number, filename, line for each key: lines
awk -v num_files=${num_files:-$#} '
{
keys[$1,$2]++;
if (++key_file[$1,$2,FILENAME] == 1)
files[$1,$2]++
#printf "%s:%d: Key (%s,%s); keys = %d; key_file = %d; files = %d\n",
# FILENAME, FNR, $1, $2, keys[$1,$2], key_file[$1,$2,FILENAME], files[$1,$2];
sep = lines[$1,$2] ? RS : ""
#printf "B: [[\n%s\n]]\n", lines[$1,$2]
lines[$1,$2] = lines[$1,$2] sep FILENAME OFS FNR OFS $0
#printf "A: [[\n%s\n]]\n", lines[$1,$2]
}
END {
#print "END"
for (key in files)
{
#print "Key =", key, "; files =", files[key]
if (files[key] == num_files)
{
#printf "TAG\n%s\nEND\n", lines[key]
print lines[key]
}
}
}
' "$@"
Sample output (given the data files from the question):
$ bash common.sh file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0
file3.txt 3 aba 0 1 0
$ bash common.sh -n 2 file?.txt
file2.txt 5 abc 1 1
file3.txt 5 abc 1 1
$ bash common.sh -n 1 file?.txt
file1.txt 3 abc 0 1
file3.txt 1 xyx 0 0
file1.txt 4 abd 1 1
file2.txt 1 xyz 0 0
$ bash common.sh -n 3 file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0
file3.txt 3 aba 0 1 0
$ bash common.sh -n 4 file?.txt
$
You can fettle this to give the output you want (probably missing file name and line number). If you only want the lines from the first file containing a given key, you only add the information to lines
when files[$1,$2] == 1
. You can separate the recorded information with SUBSEP
instead of RS
and OFS
if you prefer.