Parsing unsorted data from large fixed width text

后端 未结 4 665
时光取名叫无心
时光取名叫无心 2021-01-19 07:09

I am mostly a Matlab user and a Perl n00b. This is my first Perl script.

I have a large fixed width data file that I would like to process into a binary file with a

相关标签:
4条回答
  • 2021-01-19 07:11

    First off, this piece of code causes the input file to be read once for every param. Which is quite in-efficient.

    foreach $current_param (@param_name) {
        ...
        seek(INFILE,$data_start_pos,0); #Jump to data start
        while ($line = <INFILE>) { ... }
        ...
    }
    

    Also there is very rarely a reason to use a continue block. This is more style / readability, then a real problem.


    Now on to make it more performant.

    I packed the sections individually, so that I could process a line exactly once. To prevent it from using up tons of RAM, I used File::Temp to store the data until I was ready for it. Then I used File::Copy to append those sections into the binary file.

    This is a quick implementation. If I were to add much more to it, I would split it up more than it is now.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use File::Temp 'tempfile';
    use File::Copy 'copy';
    use autodie qw':default copy';
    use 5.10.1;
    
    my $input_filename = shift @ARGV;
    open my $input, '<', $input_filename;
    
    my @param_names;
    my $template = ''; # stop uninitialized warning
    my @field_names;
    my $field_name_line;
    while( <$input> ){
      chomp;
      next if /^\s*$/;
      if( my ($param) = /^\s*(.+?)\s+filter = ALL_VALUES\s*$/ ){
        push @param_names, $param;
      }elsif( /^[\s-]+$/ ){
        my @fields = split /(\s+)/;
        my $pos = 0;
        for my $field (@fields){
          my $length = length $field;
          if( substr($field, 0, 1) eq '-' ){
            $template .= "\@${pos}A$length ";
          }
          $pos += $length;
        }
        last;
      }else{
        $field_name_line = $_;
      }
    }
    
    @field_names = unpack $template, $field_name_line;
    for( @field_names ){
      s(^\s+){};
      $_ = lc $_;
      $_ = 'type' if substr('type', 0, length $_) eq $_;
    }
    
    my %temp_files;
    for my $param ( @param_names ){
      for(qw'time data'){
        my $fh = tempfile 'temp_XXXX', UNLINK => 1;
        binmode $fh, ':raw';
        $temp_files{$param}{$_} = $fh;
      }
    }
    
    my %convert = (
      TXT => sub{ pack 'A*', join "\n", @_ },
      D   => sub{ pack 'd*', @_ },
      UI  => sub{ pack 'L*', @_ },
    );
    
    sub print_time{
      my($param,$time) = @_;
      my $fh = $temp_files{$param}{time};
      print {$fh} $convert{D}->($time);
    }
    
    sub print_data{
      my($param,$format,$data) = @_;
      my $fh = $temp_files{$param}{data};
      print {$fh} $convert{$format}->($data);
    }
    
    my %data_type;
    while( my $line = <$input> ){
      next if $line =~ /^\s*$/;
      my %fields;
      @fields{@field_names} = unpack $template, $line;
    
      print_time( @fields{(qw'name time')} );
      print_data( @fields{(qw'name type value')} );
    
      $data_type{$fields{name}} //= $fields{type};
    }
    close $input;
    
    open my $bin, '>:raw', $input_filename.".bin";
    open my $toc, '>',     $input_filename.".toc";
    
    for my $param( @param_names ){
      my $data_fh = $temp_files{$param}{data};
      my $time_fh = $temp_files{$param}{time};
    
      seek $data_fh, 0, 0;
      seek $time_fh, 0, 0;
    
      my @toc_line = ( $param, $data_type{$param}, 0+sysseek($bin, 0, 1) );
    
      copy( $time_fh, $bin, 8*1024 );
      close $time_fh;
      push @toc_line, sysseek($bin, 0, 1);
    
      copy( $data_fh, $bin, 8*1024 );
      close $data_fh;
      push @toc_line, sysseek($bin, 0, 1);
    
      say {$toc} join ',', @toc_line, '';
    }
    
    close $bin;
    close $toc;
    
    0 讨论(0)
  • 2021-01-19 07:27

    First, you should always have 'use strict;' and 'use warnings;' pragmas in your script.

    It seems like you need a simple array (@param_name) for reference, so loading those values would be straight forward as you have it. (again, adding the above pragmas would start showing you errors, including the $line = =~ s/^\s+//; line!)

    I suggest you read this, to understand how you can load your data file into a Hash of Hashes. Once you've designed the hash, you simply read and load the file data contents, and then iterate through the contents of the hash.

    For example, using time as the key for the hash

    %HoH = (
        1 => {
            name   => "Param1",
            ty       => "UI",
            value       => "5",
        },
        2 => {
            name   => "Param3",
            ty       => "TXT",
            value       => "Some Text 1",
        },
        3 => {
            name   => "Param1",
            ty       => "UI",
            value       => "10",
        },
    );
    

    Make sure you close the INFILE after reading in the contents, before you start processing.

    So in the end, you iterate over the hash, and reference the array (instead of the file contents) for your output writes - I would imagine it would be much faster to do this.

    Let me know if you need more info.

    Note: if you go this route, include Data:Dumper - a significant help to printing and understanding the data in your hash!

    0 讨论(0)
  • 2021-01-19 07:27

    It seems to me that embedded spaces can only occur in the last field. That makes using split ' ' feasible for this problem.

    I am assuming you are not interested in the header. In addition, I am assuming you want a vector for each parameter and are not interested in timestamps.

    To use data file names specified on the command line or piped through standard input, replace <DATA> with <>.

    #!/usr/bin/env perl
    
    use strict; use warnings;
    
    my %data;
    
    $_ = <DATA> until /^-+/; # skip header
    
    while (my $line = <DATA>) {
        $line =~ s/\s+\z//;
        last unless $line =~ /\S/;
    
        my (undef, $param, undef, $value) = split ' ', $line, 4;
        push @{ $data{ $param } }, $value;
    }
    
    use Data::Dumper;
    print Dumper \%data;
    
    __DATA__
    Param1   filter = ALL_VALUES
    Param2   filter = ALL_VALUES
    Param3   filter = ALL_VALUES
    
    Time                     Name     Ty  Value
    ---------- ---------------------- --- ------------
    1          Param1                 UI  5
    2          Param3                 TXT Some Text 1
    3          Param1                 UI  10
    4          Param2                 D   2.1234
    5          Param1                 UI  15
    6          Param2                 D   3.1234
    7          Param3                 TXT Some Text 2
    

    Output:

    $VAR1 = {
              'Param2' => [
                            '2.1234',
                            '3.1234'
                          ],
              'Param1' => [
                            '5',
                            '10',
                            '15'
                          ],
              'Param3' => [
                            'Some Text 1',
                            'Some Text 2'
                          ]
            };
    0 讨论(0)
  • 2021-01-19 07:28

    I modified my code to build a Hash as suggested. I have not incorporate the output to binary yet due to time limitations. Plus I need to figure out how to reference the hash to get the data out and pack it into binary. I don't think that part should be to difficult ... hopefully

    On an actual data file (~350MB & 2.0 Million lines) the following code takes approximately 3 minutes to build the hash. CPU usage was 100% on 1 of my cores (nill on the other 3) and Perl memory usage topped out at around 325MB ... until it dumped millions of lines to the prompt. However the print Dump will be replaced with a binary pack.

    Please let me know if I am making any rookie mistakes.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use Data::Dumper;
    
    my $lineArg1 = $ARGV[0];
    open(INFILE, $lineArg1);
    
    my $line;
    my @param_names;
    my @template;
    while ($line = <INFILE>) {
        chomp $line; #Remove New Line
        if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list
           push @param_names, trim($line);
        }
        elsif ($line =~ /^----/) {
            @template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack
            $template[-1] = 'A*';
            my $data_start_pos = tell INFILE;
            last; #Reached start of data exit loop
        }
    }
    
    my $size = $#param_names+1;
    my @getType = ((1) x $size);
    my $template = "@template";
    my @lineData;
    my %dataHash;
    my $lineCount = 0;
    while ($line = <INFILE>) {
        if ($lineCount % 100000 == 0){
            print "On Line: ".$lineCount."\n";
        }
        if ($line =~ /^\d/) { 
            chomp($line);
            @lineData = unpack $template, $line;
            my ($inHeader, $headerIndex) = findStr($lineData[1], @param_names);
            if ($inHeader) { 
                push @{$dataHash{$lineData[1]}{time} }, $lineData[0];
                push @{$dataHash{$lineData[1]}{data} }, $lineData[3];
                if ($getType[$headerIndex]){ # Things that only need written once
                    $dataHash{$lineData[1]}{type}  = $lineData[2];
                    $getType[$headerIndex] = 0;
                }
            }
        }  
    $lineCount ++; 
    } # END WHILE <INFILE>
    close(INFILE);
    
    print Dumper \%dataHash;
    
    #WRITE BINARY FILE and TOC FILE
    my %convert = (TXT=>sub{pack 'A*', join "\n", @_}, D=>sub{pack 'd*', @_}, UI=>sub{pack 'L*', @_});
    
    open my $binfile, '>:raw', $lineArg1.'.bin';
    open my $tocfile, '>', $lineArg1.'.toc';
    
    for my $param (@param_names){
        my $data = $dataHash{$param};
        my @toc_line = ($param, $data->{type}, tell $binfile );
        print {$binfile} $convert{D}->(@{$data->{time}});
        push @toc_line, tell $binfile;
        print {$binfile} $convert{$data->{type}}->(@{$data->{data}});
        push @toc_line, tell $binfile;
        print {$tocfile} join(',',@toc_line,''),"\n";
    }
    
    sub trim { #Trim leading and trailing white space
      my (@strings) = @_;
      foreach my $string (@strings) {
        $string =~ s/^\s+//;
        $string =~ s/\s+$//;
        chomp ($string);
      } 
      return wantarray ? @strings : $strings[0];
    } # END SUB
    
    sub findStr { #Return TRUE if string is contained in array.
        my $searchStr = shift;
        my $i = 0;
        foreach ( @_ ) {
            if ($_ eq $searchStr){
                return (1,$i);
            }
        $i ++;
        }
        return (0,-1);
    } # END SUB
    

    The output is as follows:

    $VAR1 = {
              'Param 1' => {
                             'time' => [
                                         '1.1',
                                         '3.2',
                                         '5.3'
                                       ],
                             'type' => 'UI',
                             'data' => [
                                         '5',
                                         '10',
                                         '15'
                                       ]
                           },
              'Param 2' => {
                             'time' => [
                                         '4.5',
                                         '6.121'
                                       ],
                             'type' => 'D',
                             'data' => [
                                         '2.1234',
                                         '3.1234'
                                       ]
                           },
              'Param 3' => {
                             'time' => [
                                         '2.23',
                                         '7.56'
                                       ],
                             'type' => 'TXT',
                             'data' => [
                                         'Some Text 1',
                                         'Some Text 2'
                                       ]
                           }
            };
    

    Here is the output TOC File:

    Param 1,UI,0,24,36,
    Param 2,D,36,52,68,
    Param 3,TXT,68,84,107,
    

    Thanks everyone for their help so far! This is an excellent resource!

    EDIT: Added Binary & TOC file writing code.

    0 讨论(0)
自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题