Perl6 : What is the best way for dealing with very big files?

前端 未结 1 1141
[愿得一人]
[愿得一人] 2021-01-01 20:23

Last week I decided to give a try to Perl6 and started to reimplement one of my program. I have to say, Perl6 is so the easy for object programming, an aspect very painfull

相关标签:
1条回答
  • 2021-01-01 21:15

    One simple improvement is to use a fixed-width encoding such as latin1 to speed up character decoding, though I'm not sure how much this will help.

    As far as Rakudo's regex/grammar engine is concerned, I've found it to be pretty slow, so it might indeed be necessary to take a more low-level approach.

    I did not do any benchmarking, but what I'd try first is something like this:

    my %seqs = slurp('genome.fa', :enc<latin1>).split('>')[1..*].map: {
        .[0] => .[1..*].join given .split("\n");
    }
    

    As the Perl6 standard library is implemented in Perl6 itself, it is sometimes possible to improve performance by just avoiding it, writing code in an imperative style such as this:

    my %seqs;
    my $data = slurp('genome.fa', :enc<latin1>);
    my $pos = 0;
    loop {
        $pos = $data.index('>', $pos) // last;
    
        my $ks = $pos + 1;
        my $ke = $data.index("\n", $ks);
    
        my $ss = $ke + 1;
        my $se = $data.index('>', $ss) // $data.chars;
    
        my @lines;
    
        $pos = $ss;
        while $pos < $se {
            my $end = $data.index("\n", $pos);
            @lines.push($data.substr($pos..^$end));
            $pos = $end + 1
        }
    
        %seqs{$data.substr($ks..^$ke)} = @lines.join;
    }
    

    However, if the parts of the standard library used has seen some performance work, this might actually make things worse. In that case, the next step to take would be adding low-level type annotations such as str and int and replacing calls to routines such as .index with NQP builtins such as nqp::index.

    If that's still too slow, you're out of luck and will need to switch languages, eg calling into Perl5 by using Inline::Perl5 or C using NativeCall.


    Note that @timotimo has done some performance measurements and wrote an article about it.

    If my short version is the baseline, the imperative version improves performance by 2.4x.

    He actually managed to squeeze a 3x improvement out of the short version by rewriting it to

    my %seqs = slurp('genome.fa', :enc<latin-1>).split('>').skip(1).map: {
        .head => .skip(1).join given .split("\n").cache;
    }
    

    Finally, rewriting the imperative version using NQP builtins sped things up by a factor of 17x, but given potential portability issues, writing such code is generally discouraged, but may be necessary for now if you really need that level of performance:

    use nqp;
    
    my Mu $seqs := nqp::hash();
    my str $data = slurp('genome.fa', :enc<latin1>);
    my int $pos = 0;
    
    my str @lines;
    
    loop {
        $pos = nqp::index($data, '>', $pos);
    
        last if $pos < 0;
    
        my int $ks = $pos + 1;
        my int $ke = nqp::index($data, "\n", $ks);
    
        my int $ss = $ke + 1;
        my int $se = nqp::index($data ,'>', $ss);
    
        if $se < 0 {
            $se = nqp::chars($data);
        }
    
        $pos = $ss;
        my int $end;
    
        while $pos < $se {
            $end = nqp::index($data, "\n", $pos);
            nqp::push_s(@lines, nqp::substr($data, $pos, $end - $pos));
            $pos = $end + 1
        }
    
        nqp::bindkey($seqs, nqp::substr($data, $ks, $ke - $ks), nqp::join("", @lines));
        nqp::setelems(@lines, 0);
    }
    
    0 讨论(0)
提交回复
热议问题