I\'d like to use grep on a text file with -f to match a long list (10,000) of patterns. Turns out that grep doesn\'t like this (who, knew?). After a day, it didn\'t produce
Here is a perl script "match_many.pl" which addresses a very common subset of the "large number of keys vs. large number of records" problem. Keys are accepted one per line from stdin. The two command line parameters are the name of the file to search and the field (white space delimited) which must match a key. This subset of the original problem can be solved quickly since the location of the match (if any) in the record is known ahead of time and the key always corresponds to an entire field in the record. In one typical case it searched 9400265 records with 42899 keys, matching 42401 of the keys and emitting 1831944 records in 41s. The more general case, where the key may appear as a substring in any part of a record, is a more difficult problem that this script does not address. (If keys never include white space and always correspond to an entire word the script could be modified to handle that case by iterating over all fields per record, instead of just testing the one, at the cost of running M times slower, where M is the average field number where the matches are found.)
#!/usr/bin/perl -w
use strict;
use warnings;
my $kcount;
my ($infile,$test_field) = @ARGV;
if(!defined($infile) || "$infile" eq "" || !defined($test_field) || ($test_field <= 0)){
die "syntax: match_many.pl infile field"
}
my %keys; # hash of keys
$test_field--; # external range (1,N) to internal range (0,N-1)
$kcount=0;
while() {
my $line = $_;
chomp($line);
$keys {$line} = 1;
$kcount++
}
print STDERR "keys read: $kcount\n";
my $records = 0;
my $emitted = 0;
open(INFILE, $infile ) or die "Could not open $infile";
while() {
if(substr($_,0,1) =~ /#/){ #skip comment lines
next;
}
my $line = $_;
chomp($line);
$line =~ s/^\s+//;
my @fields = split(/\s+/, $line);
if(exists($keys{$fields[$test_field]})){
print STDOUT "$line\n";
$emitted++;
$keys{$fields[$test_field]}++;
}
$records++;
}
$kcount=0;
while( my( $key, $value ) = each %keys ){
if($value > 1){
$kcount++;
}
}
close(INFILE);
print STDERR "records read: $records, emitted: $emitted; keys matched: $kcount\n";
exit;