Perl. Using until function

问题

I have a simple data file. Each line in the file has four elements. Some lines are filled with no blank entries. Other lines have a first entry and the remaining three are blank, or rather "filled" with a space. It is a tab delimited file.

Example of the input file:

    .
    .
    .
    30  13387412    34.80391242 sSN_FIRST
    30  13387412    34.80391242 sSN5_40
    30.1             
    30.2             
    30.3             
    30.4             
    31  14740248    65.60590089 s32138223_44
    31  14740248    65.60590089 s321382_LAST
    .
    .
    .

To reiterate, the "blanks" in my file actually contain a single space, if this matters.

My overall goal is to "fill in" the second and third column (the fourth column is ignored) throughout the file. In order to do that, I need my script to identify sets of consecutive lines that are blank, plus the line immediately preceding and the line immediately succeeding the set of consecutive blank lines. In the example above, this would be lines 2 - 7. Once I can do that, I can use the information in the flanking lines, which are filled, to help "fill in" the missing entries in the lines in between.

I have been experimenting with the until function, but I'm not succeeding in coupling it with a loop that reads the data line for line. For example, I can read the lines and find the blank lines:

open( my $FILE, "<$mapfile" );
my @file = <$FILE>;
close $FILE;

for ( my $i = 1 ; $i < scalar @file ; $i++ ) 
    {
     my @entries = split( '\t', $file[ $i ] );
     if ( $entries[ 1 ] =~ m/ / ) 
        {
         print $file[ $i ]."\n";
        }
    }

But I am trying to employ the until function, so as to read lines and search for the consecutive set of lines I am looking for ("blank" lines plus the two flanking "full" lines). For example:

until ( $file[ a line ] =~ m/ / && $file[ another line ] =~ m/ / )   
    {
     my linear interpolation here;
    }

Can anyone give me a hint about how to couple a way to read the array and compare lines to find the sets I need across the file?

回答1:

Perhaps the following will be helpful:

use strict;
use warnings;

my ( $last, $oneColumn );

my @file = <DATA>;

for my $line (@file) {
    my @entires = split ' ', $line;

    if ( @entires == 4 ) {
        if ($oneColumn) {
            print $line;    # Succeeding line
            $oneColumn = 0;
        }
        $last = $line;
        next;
    }

    print $last if $last;    # Preceeding line
    undef $last;
    print $line;             # One-column line
    $oneColumn = 1;

}

__DATA__
30  13387412    34.80391242 sSN_FIRST
30  13387412    34.80391242 sSN5_40
30.1             
30.2             
30.3             
30.4             
31  14740248    65.60590089 s32138223_44
31  14740248    65.60590089 s321382_LAST

Output:

30  13387412    34.80391242 sSN5_40
30.1
30.2
30.3
30.4
31  14740248    65.60590089 s32138223_44

A 'full', line should have four elements in @entries, and that's what if ( @entires == 4 ) looks for. If found, it'll print it as the succeeding line only if one-column lines have been printed. Then, it saves the line. Lines are printed outside the if only when the line doesn't have three tabs.

The following, shorter script produces the same output:

use strict;
use warnings;

my @file = <DATA>;

for ( my $i = 1 ; $i < $#file ; $i++ ) {

    if ( $file[$i] =~ /(?:\t\s){3}/ ) {
        print $file[ $i - 1 ];    # Preceeding line

        while ( $file[$i] =~ /(?:\t\s){3}/ and $i < $#file ) {
            print $file[ $i++ ]    # One-column line
        }

        print $file[$i];           # Succeeding line
    }
}

__DATA__
30  13387412    34.80391242 sSN_FIRST
30  13387412    34.80391242 sSN5_40
30.1             
30.2             
30.3             
30.4             
31  14740248    65.60590089 s32138223_44
31  14740248    65.60590089 s321382_LAST

The /(?:\t\s){3}/ matches three consecutive sets of tab and space, which would only be found on a line with just one column. When it finds that pattern, it prints the previous line, then enters a while loop that print the one-column lines until a full line is found or it at the end of the array. Finally, the succeeding line is printed.

回答2:

What you want to implement is a caching algorithm: something that remembers (caches) previous values, and uses them if nothing new appears. You don't even need a regex for this. :)

In addition to caching the old values, you need to cache the lines inbetween. Since you only needed the labels, you only need to hold on to those. Then, when you reach the next full line, you can do your interpolation and emit the results.

Here's how I'd do it. It's a bit more complex than my original example, but the same principle applies: just store the intermediate lines, then emit the results when you reach your terminal.

use strict;
use warnings;
use feature 'say';


# Get start conditions, and cache those numbers.

sub read_block
{
   my $line = <DATA>;
   return 1 unless defined $line; # we're done if nothing more to read

   # Process and store data from the first line in the block.
   chomp $line;
   my ($last_label, $last_num1, $last_num2, $last_label2) = split /\t/, $line;

   # Keep reading lines until we find the end of the block.
   my @label_cache;
   my $found_last = 0;
   my ($label1, $num1, $num2, $label2);
   while (!$found_last)
   {
      $line = <DATA>;
      chomp $line;
      ($label1, $num1, $num2, $label2) = split /\t/, $line;
      if (defined $num1 && defined $num2)
      {
         $found_last = 1; # We have final numbers!  We can interpolate now.
      }
      else
      {
         push @label_cache, $label1; 
      }
   }

   # Begin display.  Show the first line of the block.
   say "$last_label\t$last_num1\t$last_num2\t$last_label2";

   # Calculate the slope for interpolation: (last - first) / difference
   my $slope1 = ($num1 - $last_num1) / (@label_cache + 1);
   my $slope2 = ($num2 - $last_num2) / (@label_cache + 1);
   my $distance = 0;

   # Display each label and the lines inside.
   foreach my $label (@label_cache)
   {
      ++$distance;
      say $label, "\t",
          $slope1 * $distance + $last_num1, "\t",
          $slope2 * $distance + $last_num2;
   }

   # Display the final line in the block.
   say "$label1\t$num1\t$num2\t$label2";

   # Not done yet, so return a 'false' value.
   return 0;
}

# Main part of the script

my $done = 0;
while (! $done)
{
   $done = read_block();
}


__DATA__
a   3   4   end
e
f
g
h
i
k   15  26  start
k   15  26  end
o
p
q
r
s   3   5   start
s   3   5   end
v
w
x
y   14  16  start

emits:

a       3       4       end
e       5       7.66666666666667
f       7       11.3333333333333
g       9       15
h       11      18.6666666666667
i       13      22.3333333333333
k       15      26      start
k       15      26      end
o       12.6    21.8
p       10.2    17.6
q       7.8     13.4
r       5.4     9.2
s       3       5       start
s       3       5       end
v       5.75    7.75
w       8.5     10.5
x       11.25   13.25
y       14      16      start

You could then, of course, do whatever kind of number rounding or formatting that you needed. :)

来源：https://stackoverflow.com/questions/14245511/perl-using-until-function

标签

arrays

perl

loops

compare

conditional-statements