Parsing unsorted data from large fixed width text

后端 未结 4 663
时光取名叫无心
时光取名叫无心 2021-01-19 07:09

I am mostly a Matlab user and a Perl n00b. This is my first Perl script.

I have a large fixed width data file that I would like to process into a binary file with a

4条回答
  •  醉话见心
    2021-01-19 07:11

    First off, this piece of code causes the input file to be read once for every param. Which is quite in-efficient.

    foreach $current_param (@param_name) {
        ...
        seek(INFILE,$data_start_pos,0); #Jump to data start
        while ($line = ) { ... }
        ...
    }
    

    Also there is very rarely a reason to use a continue block. This is more style / readability, then a real problem.


    Now on to make it more performant.

    I packed the sections individually, so that I could process a line exactly once. To prevent it from using up tons of RAM, I used File::Temp to store the data until I was ready for it. Then I used File::Copy to append those sections into the binary file.

    This is a quick implementation. If I were to add much more to it, I would split it up more than it is now.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use File::Temp 'tempfile';
    use File::Copy 'copy';
    use autodie qw':default copy';
    use 5.10.1;
    
    my $input_filename = shift @ARGV;
    open my $input, '<', $input_filename;
    
    my @param_names;
    my $template = ''; # stop uninitialized warning
    my @field_names;
    my $field_name_line;
    while( <$input> ){
      chomp;
      next if /^\s*$/;
      if( my ($param) = /^\s*(.+?)\s+filter = ALL_VALUES\s*$/ ){
        push @param_names, $param;
      }elsif( /^[\s-]+$/ ){
        my @fields = split /(\s+)/;
        my $pos = 0;
        for my $field (@fields){
          my $length = length $field;
          if( substr($field, 0, 1) eq '-' ){
            $template .= "\@${pos}A$length ";
          }
          $pos += $length;
        }
        last;
      }else{
        $field_name_line = $_;
      }
    }
    
    @field_names = unpack $template, $field_name_line;
    for( @field_names ){
      s(^\s+){};
      $_ = lc $_;
      $_ = 'type' if substr('type', 0, length $_) eq $_;
    }
    
    my %temp_files;
    for my $param ( @param_names ){
      for(qw'time data'){
        my $fh = tempfile 'temp_XXXX', UNLINK => 1;
        binmode $fh, ':raw';
        $temp_files{$param}{$_} = $fh;
      }
    }
    
    my %convert = (
      TXT => sub{ pack 'A*', join "\n", @_ },
      D   => sub{ pack 'd*', @_ },
      UI  => sub{ pack 'L*', @_ },
    );
    
    sub print_time{
      my($param,$time) = @_;
      my $fh = $temp_files{$param}{time};
      print {$fh} $convert{D}->($time);
    }
    
    sub print_data{
      my($param,$format,$data) = @_;
      my $fh = $temp_files{$param}{data};
      print {$fh} $convert{$format}->($data);
    }
    
    my %data_type;
    while( my $line = <$input> ){
      next if $line =~ /^\s*$/;
      my %fields;
      @fields{@field_names} = unpack $template, $line;
    
      print_time( @fields{(qw'name time')} );
      print_data( @fields{(qw'name type value')} );
    
      $data_type{$fields{name}} //= $fields{type};
    }
    close $input;
    
    open my $bin, '>:raw', $input_filename.".bin";
    open my $toc, '>',     $input_filename.".toc";
    
    for my $param( @param_names ){
      my $data_fh = $temp_files{$param}{data};
      my $time_fh = $temp_files{$param}{time};
    
      seek $data_fh, 0, 0;
      seek $time_fh, 0, 0;
    
      my @toc_line = ( $param, $data_type{$param}, 0+sysseek($bin, 0, 1) );
    
      copy( $time_fh, $bin, 8*1024 );
      close $time_fh;
      push @toc_line, sysseek($bin, 0, 1);
    
      copy( $data_fh, $bin, 8*1024 );
      close $data_fh;
      push @toc_line, sysseek($bin, 0, 1);
    
      say {$toc} join ',', @toc_line, '';
    }
    
    close $bin;
    close $toc;
    

提交回复
热议问题