问题
I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic):
>id1
sequence 1 # could be on several line
>id2
sequence 2
...
I have run a tools that give me in tsv format:
the Identifiant, the length, and the position in bytes of the identifiant.
for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file.
# this fonction will get the sequence using seek
def get_seq(file, bites):
with open(file) as f_:
f_.seek(bites, 0) # go to the line of interest
line = f_.readline().strip() # this line is the begin of the
#sequence
to_return = "" # init the string which will contains the sequence
while not line.startswith('>') or not line: # while we do not
# encounter another identifiant
to_return += line
line = f_.readline().strip()
return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):
with open(out_file, 'a') as out_file:
out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))
# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:
indice = 0
for line in ref:
spt = line.split()
id_ = spt[0]
seq = get_seq(args.i, int(spt[2]))
write_seq(out_file=args.out, id_=id_, sequence=seq)
my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?
回答1:
Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:
subprocess.call(["samtools", "faidx", args.i])
with open(args.fai) as ref:
for line in ref:
spt = line.split()
id_ = spt[0]
subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)
回答2:
What about bash and some basic unix commands (csplit is the clue)? I wrote this simple script, but you can customize/improve it. It's not highly optimized and doesn't use index file, but nevertheless may run faster.
csplit -z -f tmp_fasta_file_ $1 '/>/' '{*}'
for file in tmp_fasta_file_*
do
TMP_FASTA_WC=$(wc -l < $file | tr -d ' ')
FASTA_WC+=$(echo "$file $TMP_FASTA_WC\n")
done
for filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" " '{print $1}')
do
cat "$filename" >> $2
done
rm tmp_fasta_file*
First positional argument is a filepath to your fasta file, second one is a filepath for output, i.e. ./script.sh input.fasta output.fasta
回答3:
Using a modified version of fastq-sort (currently available at https://github.com/blaiseli/fastq-tools), we can convert the file to fastq format using bioawk, sort with the -L
option I added, and convert back to fasta:
cat test.fasta \
| tee >(wc -l > nb_lines_fasta.txt) \
| bioawk -c fastx '{l = length($seq); printf "@"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \
| tee >(wc -l > nb_lines_fastq.txt) \
| fastq-sort -L \
| tee >(wc -l > nb_lines_fastq_sorted.txt) \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
| tee >(wc -l > nb_lines_fasta_sorted.txt) \
> test_sorted.fasta
The fasta -> fastq conversion step is quite ugly. We need to generate dummy fastq qualities with the same length as the sequence. I found no better way to do it with (bio)awk than this hack based on the "dynamic width" thing mentioned at the end of https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers.
The IIIII...
string should be longer than the longest of the input sequences, otherwise, invalid fastq will be obtained, and when converting back to fasta, bioawk seems to silently skip such invalid reads.
In the above example, I added steps to count the lines. If the line numbers are not coherent, it may be because the IIIII...
string was too short.
The resulting fasta file will have the shorter sequences first.
To get the longest sequences at the top of the file, add the -r
option to fastq-sort
.
Note that fastq-sort
writes intermediate files in /tmp
. If for some reason it is interrupted before erasing them, you may want to clean your /tmp
manually and not wait for the next reboot.
Edit
I actually found a better way to generate dummy qualities of the same length as the sequence: simply using the sequence itself:
cat test.fasta \
| bioawk -c fastx '{print "@"$name"\n"$seq"\n+\n"$seq}' \
| fastq-sort -L \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
> test_sorted.fasta
This solution is cleaner (and slightly faster), but I keep my original version above because the "dynamic width" feature of printf
and the usage of tee
to check intermediate data length may be interesting to know about.
来源:https://stackoverflow.com/questions/41239521/sort-fasta-by-sequence-size