问题
I have 2 .vcf files with genomic data and I want to remove lines in the 1st file that are also present in the second file. The code I have so far it seems to iterate only one time, removing the first hit and then stops the search. Any help would be very appreciated since I can not figure out where the problem is. Sorry for any mis-code...
I opted for arrays of arrays instead of hashes because the initial order of the file must be maintained, and I think that this is better achieved with arrays...
Code:
#!/usr/bin/perl
use strict;
use warnings;
## bring files to program
MAIN: {
my ($chrC, $posC, $junkC, $baserefC, $allrestC);
my (@ref_arrayH, @ref_arrayC);
my ($chrH, $posH, $baserefH);
my $entriesH;
# open the het file first
open (HET, $het) or die "Could not open file $het - $!";
while (<HET>) {
if (defined) {
chomp;
if (/^#/) { next; }
if ( /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
my @line_arrayH = split(/\t/, $_);
push (@ref_arrayH, \@line_arrayH); # the "reference" of each line_array is now in each element of @ref_array
$entriesH = scalar @ref_arrayH; # count the number of entries in the het file
}
}
}
close HET;
# print $entriesH,"\n";
open (CNS, $cns) or die "Could not open file $cns - $!";
foreach my $refH ( @ref_arrayH ) {
$chrH = $refH -> [0];
$posH = $refH -> [1];
$baserefH = $refH -> [3];
foreach my $line (<CNS>) {
chomp $line;
if ($line =~ /^#/) { next; }
if ($line =~ /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
($chrC, $posC, $junkC, $baserefC, $allrestC) = split(/\t/,$line);
if ( $chrC eq $chrH and $posC == $posH and $baserefC eq $baserefH ) { next }
else { print "$line\n"; }
}
}
}
# close CNS;
}
CNS file:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
HET file:
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
I would like the output to be like this
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
but is giving me this instead:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
Why is this nested loop not working properly? If I want to keep this structure of array-of-arrays, why is only doing the iteration the first time?
Would it be better to change the foreach loop
foreach my $refH ( @ref_arrayH ) {
with a for loop
for (my $i = 0; $i <= $entriesH; $i++) {
?
回答1:
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
open my $het, '<', 't.het' or die $!;
$seen{ $_ } = undef while <$het>;
close $het or die $!;
open my $cns, '<', 't.cns' or die $!;
while (my $line = <$cns>) {
next if exists $seen{ $line };
print $line;
}
close $cns or die $!;
If you want to match something other than entire lines, extract the field(s) and use it (or their combination) to key into the %seen
hash.
This will use memory in proportion to the number of lines in the HET
file.
To reduce memory usage, you can tie %seen
to a DBM_File on disk.
回答2:
If you are concerned about memory usage you can read both file one line at a time while doing the comparison. The following is an alternative approach.
Note: Because of the way filehandle works we have to reset connection every time we are to read from the file in the second nested loop.
#!/usr/bin/env perl
use strict;
use warnings;
open my $cns, '<', 't.cns' or die $!;
CNS:
while (my $line = <$cns>) { #read one line at a time from t.cns file.
open my $het, '<', 't.het' or die $!;
while (my $reference = <$het>){
if ($line eq $reference) { #test if current t.cns line is equal to any line in t.hex file.
close $het; #reset conection to t.hex file.
next CNS; # skip current t.cns line.
}
}
print $line;
close $het; #reset conection to t.hex file.
}
close $cns or die $!;
来源:https://stackoverflow.com/questions/18239225/removing-lines-in-one-file-that-are-present-in-another-file