“average length of the sequences in a fasta file”: Can you improve this Erlang code?

前端 未结 5 1364
无人共我
无人共我 2021-02-06 12:29

I\'m trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this

>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
AT         


        
相关标签:
5条回答
  • 2021-02-06 12:48

    Did you try Elixir (elixir-lang.org) which is runs on top of Erlang and has a syntax similar to Ruby. Elixir solves String problems in the following way:

    Elixir strings are UTF8 binaries, with all the raw speed and memory savings that brings. Elixir has a String module with Unicode functionality built-in and is a great example of writing code that writes code. String.Unicode reads various Unicode database dumps such as UnicodeData.txt to dynamically generate Unicode functions for the String module built straight from that data! (http://devintorr.es/blog/2013/01/22/the-excitement-of-elixir/)

    Just wonder whether Elixir would be faster?

    0 讨论(0)
  • 2021-02-06 12:52

    The call string:len(string:strip(L)) traverses the list at least twice (I'm unaware of the string:strip implementation). Instead you could write a simple function to count the line length w/0 the spaces:

    stripped_len(L) ->
      stripped_len(L, 0).
    
    stripped_len([$ |L], Len) ->
      stripped_len(L, Len);
    
    stripped_len([_C|L], Len) ->
      stripped_len(L, Len + 1);
    
    stripped_len([], Len) ->
      Len.
    

    The same method can be applied to binaries as well.

    0 讨论(0)
  • 2021-02-06 12:54

    If you need really fast IO then you have to do little bit more trickery than usual.

    -module(g).
    -export([s/0]).
    s()->
      P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
      r(P, 0, 0),
      halt().
    r(P, C, L) ->
      receive
        {P, {data, {eol, <<$>:8, _/binary>>}}} ->
          r(P, C+1, L);
        {P, {data, {eol, Line}}} ->
          r(P, C, L + size(Line));
        {'EXIT', P, normal} ->
          io:format("~p~n",[L/C])
      end.
    

    It is fastest IO as I know but note -noshell -noinput. Compile just like erlc +native +"{hipe, [o3]}" g.erl but with -smp disable

    erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl
    

    and run:

    time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
    352.6697028442464
    
    real    0m3.241s
    user    0m3.060s
    sys     0m0.124s
    

    With -smp enable but native it takes:

    $ erlc +native +"{hipe, [o3]}" g.erl
    $ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
    352.6697028442464
    
    real    0m5.103s
    user    0m4.944s
    sys     0m0.112s
    

    Byte code but with -smp disable (almost in par with native because most of work is done in port!):

    $ erlc g.erl
    $ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
    352.6697028442464
    
    real    0m3.565s
    user    0m3.436s
    sys     0m0.104s
    

    Just for completeness byte code with smp:

    $ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta 
    352.6697028442464
    
    real    0m5.433s
    user    0m5.236s
    sys     0m0.128s
    

    For comparison sarnold version gives me wrong answer and takes more on same HW:

    $ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
    ./golf.erl:5: Warning: variable 'Rest' is unused
    $ time erl -smp disable -noshell -mode minimal -s golf test
    359.04679841439776
    
    real    0m17.569s
    user    0m16.749s
    sys     0m0.664s
    

    EDIT: I have looked at characteristics of uniprot_sprot.fasta and I'm little bit surprised. It is 3824397 rows and 232MB. It means that -smp disabled version can handle 1.18 million text lines per second (71MB/s in line oriented IO).

    0 讨论(0)
  • 2021-02-06 13:08

    It looks like your big performance problems have been solved by opening the file in raw mode, but here's some more thoughts if you need to optimise that code further.

    Learn and use fprof.

    You're using string:strip/1 primarily to remove the trailing newline. As erlang values are immutable you have to make a complete copy of the list (with all the associated memory allocation) just to remove the last character. If you know the file is well formed, just subtract one from your count, otherwise I'd try writing a length function the counts the number of relevant characters and ignores irrelevant ones.

    I'm wary of advice that says binaries are better than lists, but given how little processing you it's probably the case here. The first steps are to open the file in binary mode and using erlang:size/1 to find the length.

    It won't affect performance (significantly), but the multiplication by 1.0 in Total/(1.0*Sequences) is only necessary in languages with broken division. Erlang division works correctly.

    0 讨论(0)
  • 2021-02-06 13:09

    I too am learning Erlang, thanks for the fun question.

    I understand working with Erlang strings as lists of characters can be very slow; if you can work with binaries instead you should see some performance gains. I don't know how you would use arbitrary-length strings with binaries, but if you can sort it out, it should help.

    Also, if you don't mind working with a file directly rather than standard_io, perhaps you could speed things along by using file:open(..., [raw, read_ahead]). raw means the file must be on the local node's filesystem, and read_ahead specifies that Erlang should perform file IO with a buffer. (Think of using C's stdio facilities with and without buffering.)

    I'd expect the read_ahead to make the most difference, but everything with Erlang includes the phrase "benchmark before guessing".

    EDIT

    Using file:open("uniprot_sprot.fasta", [read, read_ahead]) gets 1m31s on the full uniprot_sprot.fasta dataset. (Average 359.04679841439776.)

    Using file:open(.., [read, read_ahead]) and file:read_line(S), I get 0m34s.

    Using file:open(.., [read, read_ahead, raw]) and file:read_line(S), I get 0m9s. Yes, nine seconds.

    Here's where I stand now; if you can figure out how to use binaries instead of lists, it might see still more improvement:

    -module(golf).
    -export([test/0]).
    
    line([],{Sequences,Total}) ->  {Sequences,Total};
    line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
    line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.
    
    scanLines(S,Sequences,Total)->
            case file:read_line(S) of
                eof -> {Sequences,Total};
                {error,_} ->{Sequences,Total};
                {ok, Line} -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
            end  .
    
    test()->
        F = file:open("/home/sarnold/tmp/uniprot_sprot.fasta", [read, read_ahead, raw]),
        case F of
        { ok, File } -> 
            {Sequences,Total}=scanLines(File,0,0),
            io:format("~p\n",[Total/(1.0*Sequences)]);
        { error, Reason } ->
                io:format("~s", Reason)
        end,
        halt().
    
    0 讨论(0)
提交回复
热议问题