1st column in inFile contains a string not necessarily present in all inFiles
2nd and 7th columns in each inFile contains the Title# strings
Using AWK, I cannot piece this together correctly. My use of descriptive variables will hopefully help clarify what I'm trying to do. These are components I think I need:
- tab-separated input files:
-F'\t'
- increment the strings in the 1st column, but only add each 'name' once to the '1stColumnNames':
!1stColumnNames[$1]++ { name[++i] = $1 }
- make a new index for each .tsv file to store values for each file to avoid overwriting each column's values:
!r[FILENAME]++ { ++argind }
- store corresponding column values in 2nd and 7th columns for each file:
{ 2ndColumnVals[$1, argind] = $2 } { 7thColumnVals[$1, argind] = $7 }
- print all 1stColumnNames with associated 2ndColumnVals and 7thColumnVals, including their headers 'Title1' 'Title2' 'Title3' etc. :
?????
- index values that were empty for a particular 2ndColumnVals or 7thColumnVals, print as Mtee:
?????
- do this for all .tsv files in the current working directory and ouput a new tsv file:
*.tsv > outFile.tsv
Example Files
inFile1.tsv
Names Title1 Title2
AAAA 1111 123456
BBBBB 1111 123456
CCC 1111 123456
inFile2.tsv
Names Title3 Title4
BBBBB 2222 789456
DDDDD 2222 789456
EEEE 2222 789456
inFile3.tsv
Names Title5 Title6
AAAA 3333 987654
CCC 3333 987654
EEEE 3333 987654
outFile123.tsv
Names Title1 Title2 Title3 Title4 Title5 Title6
AAAA 1111 123456 Mtee Mtee 3333 987654
BBBBB 1111 123456 2222 789456 Mtee Mtee
CCC 1111 123456 Mtee Mtee 3333 987654
DDDDD Mtee Mtee 2222 789456 Mtee Mtee
EEEE Mtee Mtee 2222 789456 3333 987654
Test Script
GNU Awk 4.0.1 is located in /usr/bin/awk , so I made this file and executed it in the same working directory where the 3 input files are located:
named as script1.sh#### Example Usage: script1.sh inFile1.tsv inFile2.tsv inFile3.tsv > outFile123.tsv
awk -F'\t' '
FNR==1 { ++numFiles}
!seen[$1]++ { keys[++numKeys] = $1 }
{ a[$1,numFiles] = $2 FS $3 }
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s", key
for (fileNr=1;fileNr<=numFiles;fileNr++) {
printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "Mtee\tMtee")
}
print ""
}
}
' "$@"
Running awk -F script1.awk inFile1.tsv inFile2.tsv inFile3.tsv > outFile123.tsv
prints the follow error messages:
awk: cmd. line:1: inFile1.tsv
awk: cmd. line:1: ^ syntax error
Test Script 2 from konsolebox
works perfectly but I'm trying to understand each line by commenting:
#!/usr/bin/awk -f
#### named as script2.awk
#### Example Usage: awk -f script2.awk inFile1.tsv inFile2.tsv inFile3.tsv > outFile123.tsv
BEGIN { FS = "\t" } #input File Style is tab-delimited
{ sub(/\r/, "") } #remove all carriage return characters
!f[FILENAME]++ { ++indx } #for all files inputted make a single index called indx
!a[$1]++ { keys[i++] = $1 } #the new indx comprises only unique strings in column 1
{ b[$1, indx] = $2 FS $3 } #the 2nd and 3rd column are tab delimited and each pair that corresponds to a string saved in keys gets stored after the 1st column string in matrix b
END {
for (i = 0; i in keys; ++i) { #????
key = keys[i] #????
printf "%s", keys #prints out all strings in the index column 1 stored as keys
for (j = 1; j <= indx; ++j) { #????
v = b[key, j] #????
printf "\t%s", length(v) ? v : "Mtee" FS "Mtee" #print out strings as tab delimited and replace any lengths of 1 char to two Mtee separated by a tab
}
print "" #????
}
}
And here's another awk
:
#!/usr/bin/awk -f
# Set field separator to tab (\t)
BEGIN { FS = "\t" }
# Remove carriage return characters if file is in DOS format (CRLF)
{ sub(/\r/, "") }
# Increment indx by 1 (starts at 0) everytime a new file is processed
!f[FILENAME]++ { ++indx }
# Add a key ($1) to keys array every time it is first encountered
!a[$1]++ { keys[i++] = $1 }
# Store the 2nd and 3rd field to b matrix
{ b[$1, indx] = $2 FS $3 }
# This block runs after all files are processed
END {
# Traverse the keys in order
for (i = 0; i in keys; ++i) {
key = keys[i]
# Print key
printf "%s", key
# Print columns from every file in order
for (j = 1; j <= indx; ++j) {
v = b[key, j]
printf "\t%s", length(v) ? v : "Mtee" FS "Mtee"
}
# End the line with a newline
print ""
}
}
Usage:
awk -f script.awk file1 file2 file3
Output:
Names Title1 Title2 Title3 Title4 Title5 Title6
AAAA 1111 123456 Mtee Mtee 3333 987654
BBBBB 1111 123456 2222 789456 Mtee Mtee
CCC 1111 123456 Mtee Mtee 3333 987654
DDDDD Mtee Mtee 2222 789456 Mtee Mtee
EEEE Mtee Mtee 2222 789456 3333 987654
You need something like this:
Gawk version (for ARGIND plus true 2D arrays in gawk 4.0+):
$ gawk -F'\t' '
!seen[$1]++ { keys[++numKeys] = $1 }
{ a[$1][ARGIND] = $2 FS $3 }
END {
for (keyNr = 1; keyNr <= numKeys; keyNr++) {
key = keys[keyNr]
printf "%s", key
for (fileNr = 1; fileNr <= ARGIND; fileNr++) {
printf "\t%s", (fileNr in a[key] ? a[key][fileNr] : "Mtee\tMtee")
}
print ""
}
}
' file1 file2 file3
Non-gawk version:
awk -F'\t' '
FNR==1 { ++numFiles}
!seen[$1]++ { keys[++numKeys] = $1 }
{ a[$1,numFiles] = $2 FS $3 }
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s", key
for (fileNr=1;fileNr<=numFiles;fileNr++) {
printf "\t%s", ((key,fileNr) in a ? a[key,fileNr] : "Mtee\tMtee")
}
print ""
}
}
' file1 file2 file3
Names Title1 Title2 Title3 Title4 Title5 Title6
AAAA 1111 123456 Mtee Mtee 3333 987654
BBBBB 1111 123456 2222 789456 Mtee Mtee
CCC 1111 123456 Mtee Mtee 3333 987654
DDDDD Mtee Mtee 2222 789456 Mtee Mtee
EEEE Mtee Mtee 2222 789456 3333 987654
来源:https://stackoverflow.com/questions/25046526/combining-certain-columns-of-several-tab-delimited-files-based-on-first-column