Remove multiple sequences from fasta file

问题

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:

>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

In an other file I have a list of headers of sequences that I would like to remove, like this:

>header1
>header5
>header12
[...]
>header145

The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,

while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt

It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?

回答1:

Create a script with the delete commands from the second file:

sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed

Then apply that file to the first

sed -f commands.sed firstFile.txt

回答2:

$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.

Alternatively:

$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

f is whether or not the most recently read >... line was found in the target array a[]. f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.

The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >. If you know both then which one you use is a style choice.

回答3:

You may use this awk:

awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt

回答4:

The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:

The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.

Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).

The sequence can span multiple lines.

A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.

Most of the presented methods will fail on a multi-fasta with multi-line sequences

The following will work always:

awk '(NR==FNR) { toRemove[$1]; next }
     /^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
    p' headers.txt file.fasta

This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.

回答5:

This awk might work for you:

awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1

回答6:

One option is to create a long sed expression:

sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt

This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123...)

Using a file (as @daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.

回答7:

try gnu sed,

sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f -  first_file.txt

prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's

来源：https://stackoverflow.com/questions/55636069/remove-multiple-sequences-from-fasta-file

标签

bash

awk

sed

fasta