问题
Related question: https://stackoverflow.com/posts/18164848
The input file input.txt is a tab delimited unicode txt with
a A e f m
b B g h
c C i j
b B k l
I want to match by the first and second column and merge. So I want to get output.txt with
a A e f m
b B g h k l
c C i j
The code has to detect the maximum number of columns in the input. Since it is 5 in this example, "k l" were put from 6th column.
Actually I almost managed to do this using Matlab when they are all numbers. But oh, when they were letters, Matlab was so bad at handling unicode, although I read stackoverflow about how to deal with unicode in Matlab I gave up. So I now turned to python.
Nirk at https://stackoverflow.com/posts/18164848 responded that the following line will do.
awk -F\t '{a=$1 "\t" $2; $1=$2=""; x[a] = x[a] $0} END {for(y in x) print y,x[y]}'
However this code doesn't seem to specify input and output file.
回答1:
awk is pipe-based linux command. To feed input file and get output, you can do like this: awk -F\t '{a=$1 "\t" $2; $1=$2=""; x[a] = x[a] $0} END {for(y in x) print y,x[y]}' < INPUT.TXT > OUTPUT.TXT
However, the awk program above can hardly match what you need "The code has to detect the maximum number of columns in the input. Since it is 5 in this example, "k l" were put from 6th column.".
You can try this python program:
max_value_fields = 0
values = dict()
with file("input.txt") as f:
keys = []
for line in f:
line = line.strip()
fs = line.split('\t')
key = '%s\t%s' % (fs[0], fs[1])
if key not in values:
values[key] = list()
keys.append(key)
values[key].append(fs[2:])
value_fields = len(fs) - 2
if value_fields > max_value_fields:
max_value_fields = value_fields
with file("output.txt", 'w+') as f:
for key in keys:
fields = [key]
for value_list in values[key]:
fields.extend([value for value in value_list])
fields.extend(['' for i in xrange(max_value_fields - len(value_list))])
print >> f, '\t'.join(fields)
回答2:
I would read the data twice: a 1st pass for identifying the maximum number of columns, then merge rows in the 2nd pass. Results are printed (basically in random order) in the END.
awk -f script.awk infile infile > outfile
$ cat script.awk
# BEGIN block: separators
BEGIN { FS = OFS = "\t" }
# Loop #1: detect maximum number of columns
NR == FNR { max = NF > max ? NF : max ; next }
# Loop #2: merge rows
{
k = $1 FS $2
if (k in a) {
for (i = 3; i <= NF; i++) {
a[k] = a[k] OFS $i
}
} else {
NF = max
a[k] = $0
}
}
# END block: Print results
END { for (i in a) print a[i] }
回答3:
Try this:
awk '{x=$1FS$2;$1=$2="";a[x]=a[x]?a[x]FS$0:$0}END{for(x in a) print x,a[x]}' input.txt
来源:https://stackoverflow.com/questions/18168664/awk-or-gawk-to-do-data-matching-and-merging