Size is the difficult part, as to merge files you may need to read in the whole lot.
However for a general solution the the problem in perl:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my %count_of;
my @field_order;
foreach my $file (@ARGV) {
my $csv = Text::CSV->new( { binary => 1 } );
open( my $input, "<", $file ) or die $!;
my $header_row = $csv->getline($input);
foreach my $header (@$header_row) {
if ( not $count_of{$header} ) {
push( @field_order, $header );
}
$count_of{$header}++;
}
}
print "Common headers:\n";
my @common_headers = grep { $count_of{$_} >= @ARGV } keys %count_of;
print join( "\n", @common_headers );
my %lookup_row;
my $key_field;
if (@common_headers) { $key_field = shift @common_headers };
foreach my $file (@ARGV) {
my $csv = Text::CSV->new( { binary => 1 } );
open( my $input, "<", $file ) or die $!;
my @headers = @{ $csv->getline($input) };
$csv->column_names(@headers);
while ( my $row_hr = $csv->getline_hr($input) ) {
my $key = $.;
if ($key_field) {
$key = $row_hr->{$key_field};
}
$lookup_row{$key}{$file} = $row_hr;
}
close($input);
}
my $csv_out = Text::CSV->new( { binary => 1 } );
my $header_row = \@field_order;
$csv_out->print( \*STDOUT, $header_row );
print "\n";
foreach my $key ( sort keys %lookup_row ) {
my %combined_row;
foreach my $file ( sort keys %{ $lookup_row{$key} } ) {
foreach my $header (@field_order) {
if ( $lookup_row{$key}{$file}{$header} ) {
if ( not defined $combined_row{$header}
or not $combined_row{$header} eq
$lookup_row{$key}{$file}{$header} )
{
$combined_row{$header}
.= $lookup_row{$key}{$file}{$header};
}
}
}
}
my @row = @combined_row{@field_order};
$csv_out->print( \*STDOUT, \@row );
print "\n";
}
Note that Text::CSV
can be altered to redirect output to a file handle rather than STDOUT
which is probably not what you want for large files (or y'know, just > output.csv
. )
You can also configure the delimiter for Text::CSV
via sep_char
:
my $csv = Text::CSV -> new ( { binary => 1, sep_char => "\t" } );
I was unclear what your separator was, so have assumed comma (as you refer to csv
).
Script above will pick out a common field and merge on that, or line number if none exists.
Note:
This script reads files into memory and merges them there, sorting and joining on a common key. It will sort based on this for output. It's therefore memory greedy, but should 'just work' in a lot of cases. Just specify the filenames splice.pl file1.csv file2.csv file3.csv
If there is a common field in these files, it'll join on those and output in order. If there isn't, it'll use line number.