Strategies to handle a file with multiple fixed formats

后端 未结 6 856
北海茫月
北海茫月 2021-01-12 22:40

This question is not Perl-specific, (although the unpack function will most probably figure into my implementation).

I have to deal with files where multipl

相关标签:
6条回答
  • 2021-01-12 22:48

    Toying with an answer to your question, I arrived at an interesting solution with a concise main loop:

    while (<>) {
      given($_) {
        when (@{[ map $pattern{$_}, @expect]}) {}
        default {
          die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
        }
      }
    }
    

    As you'll see below, %pattern is a hash of named patterns for the different formats, and given/when against an array of Regex objects performs a short-circuiting search to find the first match.

    From this, you can infer that @expect is a list of names of formats we expect to find on the current line.

    For a while, I was stuck on the case of multiple possible expected formats and how to know format just matched, but then I remembered (?{ code }) in regular expressions:

    This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated.

    This allows something like a poor man's yacc grammar. For example, the pattern to match and process format 1 is

    fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
                 (?{ $device->{attr1} = $1;
                     @expect = qw< fmt2 >;
                   })
              /x,
    

    After processing the input from your question, $device contains

    {
      'attr1' => '109523.69142',
      'attr2' => '.981',
      'attr3' => '561A',
      'groups' => [
        {
          'date' => '10/MAY/2010',
          'nnn' => [ '24.15.30', '13.45.03' ],
          'records' => [
            [ '05:03:01', 'AB23X', '15.67', '101325.72', '14', '31.30474',  '13', '0' ],
            [ '05:03:15', 'CR22X', '16.72', '101325.42', '14', '29.16264',  '11', '0' ],
            [ '06:23:51', 'AW41X', '15.67', '101323.9',  '14', '31.264932', '19', '0' ],
          ],
        },
        {
          'date' => '11/MAY/2010',
          'nnn' => [ '24.07.13', '13.44.63' ],
          'records' => [
            [ '15:57:14', 'AB23X', '15.67', '101327.23', '14', '31.30474', '13', '0' ],
            [ '15:59:59', 'CR22X', '16.72', '101331.88', '14', '29.16264', '11', '0' ],
          ],
        }
      ],
    }
    

    I'm amused with the result, but for some reason Larry's advice in perlstyle comes to mind:

    Just because you CAN do something a particular way doesn't mean that you SHOULD do it that way.


    For completeness, a working program demonstrating the result is below.

    #! /usr/bin/perl
    
    use warnings;
    use strict;
    use feature ':5.10';
    use re 'eval';
    
    *ARGV = *DATA;
    
    my $device;
    my $record;
    my @expect = qw/ fmt1 /;
    my %pattern;
    %pattern = (
      fmt1 => qr/^ \*\* DEVICE \s+ (\S+) \s*$
                   (?{ $device->{attr1} = $1;
                       @expect = qw< fmt2 >;
                     })
                /x,
    
      fmt2 => qr/^ \s* (\S+) \s+ (\S+) \s*$
                   (?{ @{$device}{qw< attr2 attr3 >} = ($1,$2);
                       @expect = qw< fmt3 >;
                     })
                /x,
    
      # e.g., 10/MAY/2010    24.15.30,13.45.03
      fmt3 => qr/^ (\d\d\/[A-Z]{3}\/\d{4}) \s+ (\S+) \s*$
                   (?{ my($date,$nnns) = ($1,$2);
                       push @{ $device->{groups} } =>
                         { nnn  => [ split m|,| => $nnns ],
                           date => $date };
                       @expect = qw< fmt4 >;
                     })
                /x,
    
      # e.g., 05:03:01   AB23X  15.67   101325.72
      fmt4 => qr/^ (\d\d:\d\d:\d\d) \s+
                   (\S+) \s+ (\S+) \s+ (\S+)
                   \s*$
                   (?{ push @{ $device->{groups}[-1]{records} } =>
                            [ $1, $2, $3, $4 ];
                       @expect = qw< fmt4 fmt5 >;
                     })
                /x,
    
      # e.g., *           14  31.30474 13        0
      fmt5 => qr/^\* \s+ (\d+) \s+
                  # tricky: possibly no whitespace after 9-char float
                  ((?=\d{1,7}\.\d+)[\d.]{1,9}) \s*
                  (\d+) \s+ (\d+)
                  \s*$
                  (?{ push @{ $device->{groups}[-1]{records}[-1] } =>
                            $1, $2, $3, $4;
                      @expect = qw< fmt4 fmt3 fmt2 >;
                    })
                /x,
    );
    
    while (<>) {
      given($_) {
        when (@{[ map $pattern{$_}, @expect]}) {}
        default {
          die "$0: line $.: expected " . join("|" => @expect) . "; got\n$_";
        }
      }
    }
    
    use Data::Dumper;
    $Data::Dumper::Terse = $Data::Dumper::Indent = 1;
    print Dumper $device;
    
    __DATA__
    **DEVICE 109523.69142
      .981    561A
    10/MAY/2010    24.15.30,13.45.03
    05:03:01   AB23X  15.67   101325.72
    *           14  31.30474 13        0
    05:03:15   CR22X  16.72   101325.42
    *           14  29.16264 11        0
    06:23:51   AW41X  15.67    101323.9
    *           14  31.26493219        0
    11/MAY/2010    24.07.13,13.44.63
    15:57:14   AB23X  15.67   101327.23
    *           14  31.30474 13        0
    15:59:59   CR22X  16.72   101331.88
    *           14  29.16264 11        0
    
    0 讨论(0)
  • 2021-01-12 22:57

    What I used to do in this case--if possible--is have a unique regex for each line. If format #2 follows 1 line of format #1, then you can apply regex #2 right after 1. But for the line following the first #2, you want to try either #2 or #3.

    You could also have an alternation which combines #2 and #3:

    my ( $cap2_1, $cap2_2, $cap3_1, $cap3_2 ) = $line =~ /$regex2|regex3/;
    

    If #4 immediate follows 3, you'll want to apply regex #4 after #3, and regex #5. After that, because it can be either #3 or #4, you might want to repeat either the multiple match or the alternation with #3/#4.

    while ( <> ) {
        given ( $state ) { 
             when ( 1 ) { my ( $device_num )  = m/$regex1/; $state++; }
             when ( 2 ) { my ( $cap1, $cap2 ) = m/$regex2/; $state++; }
             when ( 3 ) { 
                 my ( $cap1, $cap2, $date, $nums ) = m/$regex2|$regex3/;
                 $state += $cap1 ? 1 : 2;
             }
        }
    }
    

    That kind of gives you the gist of what you might want to do. Or see FSA::Rules for a state managing module.

    0 讨论(0)
  • 2021-01-12 23:08

    I would keep an additional state in one or more variables and update it per row. Then you e. g. know if the last line was level 1, or if the last row was format 4 (and you can expect format 5), thus giving more security to your processing.

    0 讨论(0)
  • 2021-01-12 23:10

    This sounds like the sort of thing a state machine is good at. One way to do a state machine in Perl is as an object, where each state is a method. The object gives you a place to store the structure you're building, and any intermediate state you need (like the filehandle you're reading from).

    my $state = 'expect_fmt1';
    while (defined $state) {
      $state = $object->$state();
    }
    ...
    sub expect_fmt1 {
      my $self = shift;
      # read format 1, parse it, store it in object
      return 'expect_fmt2';
    }
    

    Some thoughts on handling the cases where you have to look at the line before deciding what to do with it:

    If the file is small enough, you could slurp it into an arrayref in the object. That makes it easy for a state to examine a line without removing it.

    If the file is too big for easy slurping, you can have a method for reading the next line along with a cache in your object that allows you to put it back:

    my get_line {
      my $self = shift;
      my $cache = $self->{line_cache};
      return shift @$cache if @$cache;
      return $self->{filehandle}->getline;
    }
    my unget_line { my $self = shift; unshift @{ $self->{line_cache} }, @_ }
    

    Or, you could split the states that involve this decision into two states. The first state reads the line, stores it in $self->{current_line}, decides what format it is, and returns the state that parses & stores that format (which gets the line to parse from $self->{current_line}).

    0 讨论(0)
  • 2021-01-12 23:14

    Depending what you want to do with this, it might be a good place to actually write a formal grammar, using Parse::RecDescent, for instance. This will allow you to feed the entire file to the parser, and get a datastructure out of it.

    0 讨论(0)
  • 2021-01-12 23:15

    This is a good question. Two suggestions occur to me.

    (1) The first is simply to reiterate the idea from cjm: an object-based state machine. This is a flexible way to perform complex parsing. I've used its several times and have been happy with the results in most cases.

    (2) The second idea falls under the category of a divide-and-conquer Unix-pipeline to pre-process the data.

    First an observation about your data: if a set of formats always occurs as a pair, it effectively represent a single data format and can be combined without any loss of information. This means that you have only 3 formats: 1+2, 3, and 4+5.

    And that thought leads to the strategy. Write a very simple script or two to pre-process your data -- effectively, a reformatting step to get the data into shape before the real parsing work begins. Here I show the scripts as separate tools. They could be combined, but the general philosophy might suggest that they remain distinct, narrowly defined tools.

    In unbreak_records.pl.

    Omitting the she-bang and use strict/warnings.

    while (<>){
        chomp;
        print /^\*?\s/ ? ' ' : "\n", $_;
    }
    print "\n";
    

    In add_record_types.pl

    while (<>){
        next unless /\S/;
        my $rt = /^\*/ ?   1 :
                 /^..\// ? 2 : 3;
        print $rt, ' ', $_;
    }
    

    At the command line.

    ./unbreak_records.pl orig.dat | ./add_record_types.pl > reformatted.dat
    

    Output:

    1 **DEVICE 109523.69142   .981    561A
    2 10/MAY/2010    24.15.30,13.45.03
    3 05:03:01   AB23X  15.67   101325.72 *           14  31.30474 13        0
    3 05:03:15   CR22X  16.72   101325.42 *           14  29.16264 11        0
    3 06:23:51   AW41X  15.67    101323.9 *           14  31.26493219        0
    2 11/MAY/2010    24.07.13,13.44.63
    3 15:57:14   AB23X  15.67   101327.23 *           14  31.30474 13        0
    3 15:59:59   CR22X  16.72   101331.88 *           14  29.16264 11        0
    

    The rest of the parsing is straightforward. If your data providers modify the format slightly, you simply need to write some different reformatting scripts.

    0 讨论(0)
提交回复
热议问题