Match common IDs between two huge csv files [closed]

前端未结

关注

 3  1859

长发绾君心

相关标签:

3条回答

[愿得一人]

2021-01-29 15:41
Can do this with standard join utility

file1.txt
```
1 a
2 b
3 c
```
file2.txt
```
1 002
2 x90
5 d07
```
join example
```
join -1 1 -2 1 -o 1.1,1.2,2.2 file1.txt file2.txt
```
here join is joining from file1.field1 to file2.field2 and outputting the fields specified with the -o flag

output
```
1 a 002
2 b x90
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
借酒劲吻你

2021-01-29 15:41
Load the information in memory with awk and then print the line if the id matches:
```
$ awk 'FNR==NR {a[$1]=$2; next} ($1 in a) {print $1, $2, a[$1]}' f2 f1
1 a 002
2 b x90
```
Explanation

The basic idea is to do things when reading file1 and other_things when reading file2:
```
awk 'FNR==NR {things; next} {other_things}' file1 file2
```
In our case, things is to store the content of file2 in memory, mapping every id to its value.

Then, it goes through file1 and prints the content of the line, together with the mapped value, if there is a common id.
0 讨论(0)
发布评论:

提交评论
- 加载中...

难免孤独

2021-01-29 15:49

Size is the difficult part, as to merge files you may need to read in the whole lot.

However for a general solution the the problem in perl:

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV;

my %count_of;
my @field_order;

foreach my $file (@ARGV) {
    my $csv = Text::CSV->new( { binary => 1 } );
    open( my $input, "<", $file ) or die $!;
    my $header_row = $csv->getline($input);
    foreach my $header (@$header_row) {
        if ( not $count_of{$header} ) {
            push( @field_order, $header );
        }
        $count_of{$header}++;
    }
}

print "Common headers:\n";
my @common_headers = grep { $count_of{$_} >= @ARGV } keys %count_of;
print join( "\n", @common_headers );

my %lookup_row;
my $key_field;
if (@common_headers) { $key_field = shift @common_headers };

foreach my $file (@ARGV) {
    my $csv = Text::CSV->new( { binary => 1 } );
    open( my $input, "<", $file ) or die $!;
    my @headers = @{ $csv->getline($input) };
    $csv->column_names(@headers);
    while ( my $row_hr = $csv->getline_hr($input) ) {
        my $key = $.;
        if ($key_field) {
            $key = $row_hr->{$key_field};
        }
        $lookup_row{$key}{$file} = $row_hr;
    }
    close($input);
}

my $csv_out = Text::CSV->new( { binary => 1 } );
my $header_row = \@field_order;
$csv_out->print( \*STDOUT, $header_row );
print "\n";

foreach my $key ( sort keys %lookup_row ) {
    my %combined_row;
    foreach my $file ( sort keys %{ $lookup_row{$key} } ) {
        foreach my $header (@field_order) {
            if ( $lookup_row{$key}{$file}{$header} ) {
                if (   not defined $combined_row{$header}
                    or not $combined_row{$header} eq
                    $lookup_row{$key}{$file}{$header} )
                {
                    $combined_row{$header}
                        .= $lookup_row{$key}{$file}{$header};
                }
            }
        }
    }
    my @row = @combined_row{@field_order};
    $csv_out->print( \*STDOUT, \@row );
    print "\n";
}

Note that Text::CSV can be altered to redirect output to a file handle rather than STDOUT which is probably not what you want for large files (or y'know, just > output.csv. )

You can also configure the delimiter for Text::CSV via sep_char:

my $csv = Text::CSV -> new ( { binary => 1, sep_char => "\t" } );

I was unclear what your separator was, so have assumed comma (as you refer to csv).

Script above will pick out a common field and merge on that, or line number if none exists.

Note:

This script reads files into memory and merges them there, sorting and joining on a common key. It will sort based on this for output. It's therefore memory greedy, but should 'just work' in a lot of cases. Just specify the filenames splice.pl file1.csv file2.csv file3.csv

If there is a common field in these files, it'll join on those and output in order. If there isn't, it'll use line number.

0 讨论(0)

热议问题