“average length of the sequences in a fasta file”: Can you improve this Erlang code?

前端未结

关注

 5  1364

I\'m trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this

>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
AT


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2021-02-06 12:48
              
            
            
                                                                       
Did you try Elixir (elixir-lang.org) which is runs on top of Erlang and has a syntax similar to Ruby. Elixir solves String problems in the following way:


  Elixir strings are UTF8 binaries, with all the raw speed and memory
  savings that brings. Elixir has a String module with Unicode
  functionality built-in and is a great example of writing code that
  writes code. String.Unicode reads various Unicode database dumps such
  as UnicodeData.txt to dynamically generate Unicode functions for the
  String module built straight from that data! (http://devintorr.es/blog/2013/01/22/the-excitement-of-elixir/)


Just wonder whether Elixir would be faster?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2021-02-06 12:52
              
            
            
                                                                       
The call string:len(string:strip(L)) traverses the list at least twice (I'm unaware of the string:strip implementation). Instead you could write a simple function to count the line length w/0 the spaces:

stripped_len(L) ->
  stripped_len(L, 0).

stripped_len([$ |L], Len) ->
  stripped_len(L, Len);

stripped_len([_C|L], Len) ->
  stripped_len(L, Len + 1);

stripped_len([], Len) ->
  Len.


The same method can be applied to binaries as well.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2021-02-06 12:54
              
            
            
                                                                       
If you need really fast IO then you have to do little bit more trickery than usual.

-module(g).
-export([s/0]).
s()->
  P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
  r(P, 0, 0),
  halt().
r(P, C, L) ->
  receive
    {P, {data, {eol, <<$>:8, _/binary>>}}} ->
      r(P, C+1, L);
    {P, {data, {eol, Line}}} ->
      r(P, C, L + size(Line));
    {'EXIT', P, normal} ->
      io:format("~p~n",[L/C])
  end.


It is fastest IO as I know but note -noshell -noinput.
Compile just like erlc +native +"{hipe, [o3]}" g.erl but with -smp disable

erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl


and run:

time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s


With -smp enable but native it takes:

$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m5.103s
user    0m4.944s
sys     0m0.112s


Byte code but with -smp disable (almost in par with native because most of work is done in port!):

$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m3.565s
user    0m3.436s
sys     0m0.104s


Just for completeness byte code with smp:

$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta 
352.6697028442464

real    0m5.433s
user    0m5.236s
sys     0m0.128s


For comparison sarnold version gives me wrong answer and takes more on same HW:

$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776

real    0m17.569s
user    0m16.749s
sys     0m0.664s


EDIT: I have looked at characteristics of uniprot_sprot.fasta and I'm little bit surprised. It is 3824397 rows and 232MB. It means that -smp disabled version can handle 1.18 million text lines per second (71MB/s in line oriented IO).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天命终不由人        
                
              
                            
                2021-02-06 13:08
              
            
            
                                                                       
It looks like your big performance problems have been solved by opening the file in raw mode, but here's some more thoughts if you need to optimise that code further.

Learn and use fprof.

You're using string:strip/1 primarily to remove the trailing newline. As erlang values are immutable you have to make a complete copy of the list (with all the associated memory allocation) just to remove the last character.  If you know the file is well formed, just subtract one from your count, otherwise I'd try writing a length function the counts the number of relevant characters and ignores irrelevant ones.

I'm wary of advice that says binaries are better than lists, but given how little processing you it's probably the case here.  The first steps are to open the file in binary mode and using erlang:size/1 to find the length.

It won't affect performance (significantly), but the multiplication by 1.0 in Total/(1.0*Sequences) is only necessary in languages with broken division.  Erlang division works correctly.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2021-02-06 13:09
              
            
            
                                                                       
I too am learning Erlang, thanks for the fun question.

I understand working with Erlang strings as lists of characters can be very slow; if you can work with binaries instead you should see some performance gains. I don't know how you would use arbitrary-length strings with binaries, but if you can sort it out, it should help.

Also, if you don't mind working with a file directly rather than standard_io, perhaps you could speed things along by using file:open(..., [raw, read_ahead]). raw means the file must be on the local node's filesystem, and read_ahead specifies that Erlang should perform file IO with a buffer. (Think of using C's stdio facilities with and without buffering.)

I'd expect the read_ahead to make the most difference, but everything with Erlang includes the phrase "benchmark before guessing".

EDIT

Using file:open("uniprot_sprot.fasta", [read, read_ahead]) gets 1m31s on the full uniprot_sprot.fasta dataset. (Average 359.04679841439776.)

Using file:open(.., [read, read_ahead]) and file:read_line(S), I get 0m34s.

Using file:open(.., [read, read_ahead, raw]) and file:read_line(S), I get 0m9s. Yes, nine seconds.

Here's where I stand now; if you can figure out how to use binaries instead of lists, it might see still more improvement:

-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case file:read_line(S) of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            {ok, Line} -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    F = file:open("/home/sarnold/tmp/uniprot_sprot.fasta", [read, read_ahead, raw]),
    case F of
    { ok, File } -> 
        {Sequences,Total}=scanLines(File,0,0),
        io:format("~p\n",[Total/(1.0*Sequences)]);
    { error, Reason } ->
            io:format("~s", Reason)
    end,
    halt().

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复