Removing lines in one file that are present in another file

北战南征 提交于 2019-12-11 09:05:01

问题


I have 2 .vcf files with genomic data and I want to remove lines in the 1st file that are also present in the second file. The code I have so far it seems to iterate only one time, removing the first hit and then stops the search. Any help would be very appreciated since I can not figure out where the problem is. Sorry for any mis-code...

I opted for arrays of arrays instead of hashes because the initial order of the file must be maintained, and I think that this is better achieved with arrays...

Code:

#!/usr/bin/perl

use strict;
use warnings;

## bring files to program

MAIN: {

my ($chrC, $posC, $junkC, $baserefC, $allrestC);

my (@ref_arrayH, @ref_arrayC);

my ($chrH, $posH, $baserefH);
my $entriesH;

# open the het file first
open (HET, $het) or die "Could not open file $het - $!";
while (<HET>) {
    if (defined) {
        chomp;
        if (/^#/) { next; }
        if ( /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) {   # is a VCF file
            my @line_arrayH = split(/\t/, $_);              
            push (@ref_arrayH, \@line_arrayH);      # the "reference" of each line_array is now in each element of @ref_array
            $entriesH = scalar @ref_arrayH;             # count the number of entries in the het file
        }
    }
}
close HET;

#   print $entriesH,"\n";

open (CNS, $cns) or die "Could not open file $cns - $!";

foreach my $refH ( @ref_arrayH ) {
    $chrH = $refH -> [0];
    $posH = $refH -> [1];
    $baserefH = $refH -> [3];

    foreach my $line (<CNS>) {
        chomp $line;
        if ($line =~ /^#/) { next; }
        if ($line =~ /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) {   # is a VCF file
            ($chrC, $posC, $junkC, $baserefC, $allrestC) = split(/\t/,$line);
                if ( $chrC eq $chrH and $posC == $posH and $baserefC eq $baserefH ) { next }
                else { print "$line\n"; }
        }
    }
}
 #  close CNS;

 }

CNS file:

chrI    1084    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1085    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1086    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1087    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1088    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1089    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1090    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1091    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1099    .   A   .   32.8    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL  0
chrI    1100    .   G   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1101    .   G   .   12.3    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL  18
chrI    1102    .   G   .   32.9    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL  0
chrI    1103    .   A   .   5.45    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   PL  26
chrI    1104    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1105    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0

HET file:

chrI    1087    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1104    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4

I would like the output to be like this

chrI    1084    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1085    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1086    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1088    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1089    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1090    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1091    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1099    .   A   .   32.8    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL  0
chrI    1100    .   G   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1101    .   G   .   12.3    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL  18
chrI    1102    .   G   .   32.9    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL  0
chrI    1103    .   A   .   5.45    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   PL  26
chrI    1105    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0

but is giving me this instead:

chrI    1084    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1085    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1086    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1088    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1089    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1090    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1091    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1099    .   A   .   32.8    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL  0
chrI    1100    .   G   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1101    .   G   .   12.3    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL  18
chrI    1102    .   G   .   32.9    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL  0
chrI    1103    .   A   .   5.45    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   PL  26
chrI    1104    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1105    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0

Why is this nested loop not working properly? If I want to keep this structure of array-of-arrays, why is only doing the iteration the first time?

Would it be better to change the foreach loop

foreach my $refH ( @ref_arrayH ) {

with a for loop

for (my $i = 0; $i <= $entriesH; $i++) {

?


回答1:


#!/usr/bin/env perl

use strict;
use warnings;

my %seen;

open my $het, '<', 't.het' or die $!;
$seen{ $_ } = undef while <$het>;
close $het or die $!;

open my $cns, '<', 't.cns' or die $!;

while (my $line = <$cns>) {
    next if exists $seen{ $line };
    print $line;
}

close $cns or die $!;

If you want to match something other than entire lines, extract the field(s) and use it (or their combination) to key into the %seen hash.

This will use memory in proportion to the number of lines in the HET file.

To reduce memory usage, you can tie %seen to a DBM_File on disk.




回答2:


If you are concerned about memory usage you can read both file one line at a time while doing the comparison. The following is an alternative approach.

Note: Because of the way filehandle works we have to reset connection every time we are to read from the file in the second nested loop.

#!/usr/bin/env perl

use strict;
use warnings;

open my $cns, '<', 't.cns' or die $!;

CNS: 
while (my $line = <$cns>) { #read one line at a time from t.cns file.
    open my $het, '<', 't.het' or die $!;
    while (my $reference = <$het>){
            if ($line eq $reference) { #test if current t.cns line is equal to any line in t.hex file. 
                close $het; #reset conection to t.hex file. 
                next CNS; # skip current t.cns line.            
    }
}   
    print $line;  
    close $het; #reset conection to t.hex file.
}
close $cns or die $!;


来源:https://stackoverflow.com/questions/18239225/removing-lines-in-one-file-that-are-present-in-another-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!