“average length of the sequences in a fasta file”: Can you improve this Erlang code?

前端 未结 5 1362
无人共我
无人共我 2021-02-06 12:29

I\'m trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this

>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
AT         


        
5条回答
  •  既然无缘
    2021-02-06 12:54

    If you need really fast IO then you have to do little bit more trickery than usual.

    -module(g).
    -export([s/0]).
    s()->
      P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
      r(P, 0, 0),
      halt().
    r(P, C, L) ->
      receive
        {P, {data, {eol, <<$>:8, _/binary>>}}} ->
          r(P, C+1, L);
        {P, {data, {eol, Line}}} ->
          r(P, C, L + size(Line));
        {'EXIT', P, normal} ->
          io:format("~p~n",[L/C])
      end.
    

    It is fastest IO as I know but note -noshell -noinput. Compile just like erlc +native +"{hipe, [o3]}" g.erl but with -smp disable

    erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl
    

    and run:

    time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
    352.6697028442464
    
    real    0m3.241s
    user    0m3.060s
    sys     0m0.124s
    

    With -smp enable but native it takes:

    $ erlc +native +"{hipe, [o3]}" g.erl
    $ time erl -noshell -mode minimal -boot start_clean -noinput -s g s

    Byte code but with -smp disable (almost in par with native because most of work is done in port!):

    $ erlc g.erl
    $ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s

    Just for completeness byte code with smp:

    $ time erl -noshell -mode minimal -boot start_clean -noinput -s g s

    For comparison sarnold version gives me wrong answer and takes more on same HW:

    $ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
    ./golf.erl:5: Warning: variable 'Rest' is unused
    $ time erl -smp disable -noshell -mode minimal -s golf test
    359.04679841439776
    
    real    0m17.569s
    user    0m16.749s
    sys     0m0.664s
    

    EDIT: I have looked at characteristics of uniprot_sprot.fasta and I'm little bit surprised. It is 3824397 rows and 232MB. It means that -smp disabled version can handle 1.18 million text lines per second (71MB/s in line oriented IO).

提交回复
热议问题