Suppose that I have two files, en.csv
and sp.csv
, each containing exactly two comma-separated records:
en.csv
:
1,d
Here is a solution that might or might not work for your data. It approaches the problem by aligning the records within a csv file by line number, i.e. record 2
ends up on line 2
, record 3123
on line number 3123
and so on. Missing records/lines are padded with MISSING
fields, so the input files would be mangled to look like this:
en.csv
:
1,dog,red,car
2,MISSING,MISSING,MISSING
3,cat,white,boat
de.csv
:
1,Hund,Rot,Auto
2,Kaninchen,Grau,Zug
3,MISSING,MISSING,MISSING
sp.csv
:
1,MISSING,MISSING,MISSING
2,conejo,gris,tren
3,gato,blanco,bote
From there it is easy to cut out the columns of interest and just print them side-by-side using paste
.
To achieve this, we sort the input files first and then apply some stupid awk
magic:
join -o auto
does) MISSING
fields until the alignment is correct againMISSING
fields are printed until the maximum is hit.reccut.sh
:
#!/bin/bash
get_max_recnum()
{
awk -F, '{ if ($1 > max) { max = $1 } } END { print max }' "$@"
}
align_by_recnum()
{
sort -t, -k1 "$1" \
| awk -F, -v MAXREC="$2" '
NR==1 { for(x = 1; x < NF; x++) missing = missing ",MISSING" }
{
i = NR
if (NR < $1)
{
while (i < $1)
{
print i++ missing
}
NR+=i
}
}1
END { for(i++; i <= MAXREC; i++) { print i missing } }
'
}
_reccut()
{
local infiles=()
local args=( $@ )
for arg; do
infiles+=( "$2" )
shift 2
done
MAXREC="$(get_max_recnum "${infiles[@]}")" __reccut "${args[@]}"
}
__reccut()
{
local cols="$1"
local infile="$2"
shift 2
if (( $# > 0 )); then
paste -d, \
<(align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}) \
<(__reccut "$@")
else
align_by_recnum "${infile}" "${MAXREC}" | cut -d, -f ${cols}
fi
}
_reccut "$@"
$ ./reccut.sh 3 en.csv 2,4 sp.csv 3 de.csv
red,MISSING,MISSING,Rot
MISSING,conejo,tren,Grau
white,gato,bote,MISSING