bash? - combining files into CSVs

问题

I know (see here) that you can use paste to combine multiple files into a .csv file if each file holds a column

i.e.. paste -d "," column1.dat column2.dat column3.dat ... > myDat.csv will result in

myDat.csv

column1,   column2,   column3, ...
c1-1,      c2-1,      c3-1,    ...
c1-2,      c2-2,      c3-2,    ...
...        ...        ...

(without the tabs. just inserted them to make it more readable)

What if I have multiple measurements, instead?

e.g.

file1.dat has format <xvalue> <y1value>

file2.dat has format <xvalue> <y2avlue>

file3.dat has format <xvalue> <uvalue> <vvalue>

and I ultimately want a csv like

<xvalue>, <y1value>, <y2value>, <empty column>, <uvalue>, <vvalue>

How do I combine the files now?

Edit

Note that although each file is sorted (or can be sorted if it's not), they don't necessarily contain the same xvalues on the same lines.

If a file doesn't have an xvalue that another file does have, its corresponding column entry should be blank.

(Actually, I think dropping the rows for xvalues that aren't present in all files should also work.)

回答1:

Ok, here is my solution in Gnu awk which tries to lean towards being a more generic solution and handles that extra empty column with external tools. It is in Gnu awk since it uses multidimensional arrays but could probably easily be generalized to other awks as well.

The program joins fields expecting the first field of each file to be the key column. If it does not find a key to join to, it creates a new key and outputs nonexistent fields as empty when outputing (notice keys x_3, x_4 and x_5 below in data files).

First the data files:

$ cat file[123].dat             # 3 files, separated by empty lines for clarity
x_1 y1_1
x_2 y1_2
x_3 y1_3

x_1 y2_1
x_2 y2_2
x_4 y2_4

x_1 u_1 v_1
x_2 u_2 v_2
x_5 u_5 v_5

And the code:

$ cat program.awk
BEGIN { OFS=", " }
FNR==1 { f++ }                                # counter of files
{
    a[0][$1]=$1                               # reset the key for every record 
    for(i=2;i<=NF;i++)                        # for each non-key element
        a[f][$1]=a[f][$1] $i ( i==NF?"":OFS ) # combine them to array element
}
END {                                         # in the end
    for(i in a[0])                            # go thru every key
        for(j=0;j<=f;j++)                     # and all related array elements
            printf "%s%s", a[j][i], (j==f?ORS:OFS)
}                                             # output them, nonexistent will output empty

Usage and output:

$ awk -f program.awk \
file1.dat \
file2.dat \
<(grep -h . file[123].dat|cut -d\  -f 1|sort|uniq) \
file3.dat 
x_1, y1_1, y2_1, , u_1, v_1
x_2, y1_2, y2_2, , u_2, v_2
x_3, y1_3, , , 
x_4, , y2_4, , 
x_5, , , , u_5, v_5

The empty column after file2.dat will be generated with empty field created by gathering all the keys and inputing them as another "file" (using process substitution <()) to keep the program more generic:

$ grep -h . file[123].dat|cut -d\  -f 1|sort|uniq
x_1
x_2
x_3
x_4
x_5

回答2:

Just use a process substitution?

paste -d, > myDat.csv \
  file1.dat \
  <(cut -d' ' -f2 file2.dat) \
  /dev/null \
  <(cut -d' ' -f2,3 file3.dat)

回答3:

You can use paste to combine all the files, and then use awk to only print the columns you want (including an empty column):

paste file1.dat file2.dat file3.dat | awk -v OFS=', ' '{print $1,$2,$4,"",$6,$7}'

Notice that columns $3 and $5 are excluded from the awk command because they are the same as column $1 (i.e. they are all <xvalue>).

来源：https://stackoverflow.com/questions/40373180/bash-combining-files-into-csvs

标签

bash

csv

text

data-manipulation