Perl Merge file

前端未结

关注

 4  806

I have 3 or multiple files I need to merge, the data looks like this..

file 1
0334.45656
0334.45678
0335.67899
file 2
0334.89765
0335.12346
0335.56789
file 3


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2021-01-15 07:14
              
            
            
                                                                       
I wouldn't normally suggest this, but unix utilties should be able to handle this just fine.


cat the 3 files together.
use sort to sort the merged file.


However, using perl, could just do the following:

#!/usr/bin/perl

use strict;
use warnings;

my @data;
push @data, $_ while (<>);

# Because the numbers are all equal length, alpha sort will work here
print for sort @data;


However, as we've discussed, it's possible that the files will be extremely large.  Therefore it will be more efficient both in memory and speed if you're able to take advantage of the fact that all the files are already sorted.

The following solution therefore streams the files, pulling out the next one in order each loop of the while:

#!/usr/bin/perl

# Could name this catsort.pl

use strict;
use warnings;
use autodie;

# Initialize File handles
my @fhs = map {open my $fh, '<', $_; $fh} @ARGV;

# First Line of each file
my @data = map {scalar <$_>} @fhs;

# Loop while a next line exists
while (@data) {
    # Pull out the next entry.
    my $index = (sort {$data[$a] cmp $data[$b]} (0..$#data))[0];

    print $data[$index];

    # Fill In next Data at index.
    if (! defined($data[$index] = readline $fhs[$index])) {
        # End of that File
        splice @fhs, $index, 1;
        splice @data, $index, 1;
    }
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长情又很酷        
                
              
                            
                2021-01-15 07:15
              
            
            
                                                                       
Using Miller's idea in a more reusable way,

use strict;
use warnings;

sub get_sort_iterator {
  my @fhs = map {open my $fh, '<', $_ or die $!; $fh} @_;
  my @d;

  return sub {
    for my $i (0 .. $#fhs) {
      # skip to next file handle if it doesn't exists or we have value in $d[$i]
      next if !$fhs[$i] or defined $d[$i];

      # reading from $fhs[$i] file handle was success?
      if ( defined($d[$i] = readline($fhs[$i])) ) { chomp($d[$i]) }
      # file handle at EOF, not needed any more
      else  { undef $fhs[$i] }
    }
    # compare as numbers, return undef if no more data
    my ($index) = sort {$d[$a] <=> $d[$b]} grep { defined $d[$_] } 0..$#d
      or return;

    # return value from $d[$index], and set it to undef
    return delete $d[$index];
  };
}

my $iter = get_sort_iterator(@ARGV);
while (defined(my $x = $iter->())) {
  print "$x\n";
}


output

0334.12345
0334.45656
0334.45678
0334.89765
0335.12346
0335.45678
0335.56789
0335.67899
0335.98764

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2021-01-15 07:23
              
            
            
                                                                       
Suppose every input files are already in ascending order and have at least one line in them, this script could merge them in ascending order:

#!/usr/bin/perl

use warnings;
use strict;

use List::Util 'reduce';

sub min_index {
    reduce { $_[$a] < $_[$b] ? $a : $b } 0 .. $#_;
}

my @fhs = map { open my $fh, '<', $_; $fh } @ARGV;
my @data = map { scalar <$_> } @fhs;

while (@data) {
    my $idx = min_index(@data);
    print "$data[$idx]";
    if (! defined($data[$idx] = readline $fhs[$idx])) {
        splice @data, $idx, 1;
        splice @fhs, $idx, 1;
    }
}




Note: this is basic the same as the second script offered by @Miller, but a bit clearer and more concise.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一个人的身影        
                
              
                            
                2021-01-15 07:36
              
            
            
                                                                       
I suggest this solution, which uses a sorted array of hashes - each hash corresponding to an input file, and containing a file handle fh, the last line read line and the timestamp extracted from the line timestamp.

The hash at the end of the array always corresponds to the input that has the smallest value for the timestamp, so all that is necessary is to repeateedly pop the next value from the array, print its data, read the next line and (if it hasn't reached eof) insert it back into the array in sorted order.

This could produce an appreciable increase in speed over the repeated sorting of all the data for each output line that other answers use.

Note that the program expects the list of input files as parameters on the command line, and sends its merged output to STDOUT. It also assumes that the input files are already sorted.

use strict;
use warnings;
use autodie;

my @data;

for my $file (@ARGV) {
  my $item;
  open $item->{fh}, '<', $file;
  insert_item($item, \@data);
}

while (@data) {
  my $item = pop @data;
  print $item->{line};
  insert_item($item, \@data);
}

sub insert_item {
  my ($item, $array) = @_;
  return if eof $item->{fh};
  $item->{line} = readline $item->{fh};
  ($item->{timestamp}) = $item->{line} =~ /^(\d+)/;
  my $i = 0;
  ++$i while $i < @$array and $item->{timestamp} < $array->[$i]{timestamp};
  splice @$array, $i, 0, $item;
}


output

0334.45656
0334.89765
0334.12345
0334.45678
0335.12346
0335.45678
0335.67899
0335.56789
0335.98764

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复