Parsing unsorted data from large fixed width text

后端未结

关注

 4  663

时光取名叫无心 2021-01-19 07:09

I am mostly a Matlab user and a Perl n00b. This is my first Perl script.

I have a large fixed width data file that I would like to process into a binary file with a

4条回答

醉话见心 (楼主)

2021-01-19 07:11

First off, this piece of code causes the input file to be read once for every param. Which is quite in-efficient.

foreach $current_param (@param_name) {
    ...
    seek(INFILE,$data_start_pos,0); #Jump to data start
    while ($line = ) { ... }
    ...
}

Also there is very rarely a reason to use a continue block. This is more style / readability, then a real problem.

Now on to make it more performant.

I packed the sections individually, so that I could process a line exactly once. To prevent it from using up tons of RAM, I used File::Temp to store the data until I was ready for it. Then I used File::Copy to append those sections into the binary file.

This is a quick implementation. If I were to add much more to it, I would split it up more than it is now.

#!/usr/bin/perl

use strict;
use warnings;
use File::Temp 'tempfile';
use File::Copy 'copy';
use autodie qw':default copy';
use 5.10.1;

my $input_filename = shift @ARGV;
open my $input, '<', $input_filename;

my @param_names;
my $template = ''; # stop uninitialized warning
my @field_names;
my $field_name_line;
while( <$input> ){
  chomp;
  next if /^\s*$/;
  if( my ($param) = /^\s*(.+?)\s+filter = ALL_VALUES\s*$/ ){
    push @param_names, $param;
  }elsif( /^[\s-]+$/ ){
    my @fields = split /(\s+)/;
    my $pos = 0;
    for my $field (@fields){
      my $length = length $field;
      if( substr($field, 0, 1) eq '-' ){
        $template .= "\@${pos}A$length ";
      }
      $pos += $length;
    }
    last;
  }else{
    $field_name_line = $_;
  }
}

@field_names = unpack $template, $field_name_line;
for( @field_names ){
  s(^\s+){};
  $_ = lc $_;
  $_ = 'type' if substr('type', 0, length $_) eq $_;
}

my %temp_files;
for my $param ( @param_names ){
  for(qw'time data'){
    my $fh = tempfile 'temp_XXXX', UNLINK => 1;
    binmode $fh, ':raw';
    $temp_files{$param}{$_} = $fh;
  }
}

my %convert = (
  TXT => sub{ pack 'A*', join "\n", @_ },
  D   => sub{ pack 'd*', @_ },
  UI  => sub{ pack 'L*', @_ },
);

sub print_time{
  my($param,$time) = @_;
  my $fh = $temp_files{$param}{time};
  print {$fh} $convert{D}->($time);
}

sub print_data{
  my($param,$format,$data) = @_;
  my $fh = $temp_files{$param}{data};
  print {$fh} $convert{$format}->($data);
}

my %data_type;
while( my $line = <$input> ){
  next if $line =~ /^\s*$/;
  my %fields;
  @fields{@field_names} = unpack $template, $line;

  print_time( @fields{(qw'name time')} );
  print_data( @fields{(qw'name type value')} );

  $data_type{$fields{name}} //= $fields{type};
}
close $input;

open my $bin, '>:raw', $input_filename.".bin";
open my $toc, '>',     $input_filename.".toc";

for my $param( @param_names ){
  my $data_fh = $temp_files{$param}{data};
  my $time_fh = $temp_files{$param}{time};

  seek $data_fh, 0, 0;
  seek $time_fh, 0, 0;

  my @toc_line = ( $param, $data_type{$param}, 0+sysseek($bin, 0, 1) );

  copy( $time_fh, $bin, 8*1024 );
  close $time_fh;
  push @toc_line, sysseek($bin, 0, 1);

  copy( $data_fh, $bin, 8*1024 );
  close $data_fh;
  push @toc_line, sysseek($bin, 0, 1);

  say {$toc} join ',', @toc_line, '';
}

close $bin;
close $toc;

0 讨论(0)

查看其它4个回答