I\'m trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this
>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
AT
If you need really fast IO then you have to do little bit more trickery than usual.
-module(g).
-export([s/0]).
s()->
P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
r(P, 0, 0),
halt().
r(P, C, L) ->
receive
{P, {data, {eol, <<$>:8, _/binary>>}}} ->
r(P, C+1, L);
{P, {data, {eol, Line}}} ->
r(P, C, L + size(Line));
{'EXIT', P, normal} ->
io:format("~p~n",[L/C])
end.
It is fastest IO as I know but note -noshell -noinput
.
Compile just like erlc +native +"{hipe, [o3]}" g.erl
but with -smp disable
erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl
and run:
time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464
real 0m3.241s
user 0m3.060s
sys 0m0.124s
With -smp enable
but native it takes:
$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s
Byte code but with -smp disable
(almost in par with native because most of work is done in port!):
$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s
Just for completeness byte code with smp:
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s
For comparison sarnold version gives me wrong answer and takes more on same HW:
$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776
real 0m17.569s
user 0m16.749s
sys 0m0.664s
EDIT: I have looked at characteristics of uniprot_sprot.fasta
and I'm little bit surprised. It is 3824397 rows and 232MB. It means that -smp disabled
version can handle 1.18 million text lines per second (71MB/s in line oriented IO).