问题
I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:
>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
In an other file I have a list of headers of sequences that I would like to remove, like this:
>header1
>header5
>header12
[...]
>header145
The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,
while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt
It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?
回答1:
Create a script with the delete commands from the second file:
sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed
Then apply that file to the first
sed -f commands.sed firstFile.txt
回答2:
$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
c
is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.
Alternatively:
$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
f
is whether or not the most recently read >...
line was found in the target array a[]
. f=($0 in a ? 1 : 0)
could be abbreviated to just f=($0 in a)
but I prefer the ternary expression for clarity.
The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >
. If you know both then which one you use is a style choice.
回答3:
You may use this awk
:
awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt
回答4:
The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:
- The description line (defline) or header/identifier line, which begins with <greater-then> character (
>
), gives a name and/or a unique identifier for the sequence, and may also contain additional information.- Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
- The sequence can span multiple lines.
- A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.
Most of the presented methods will fail on a multi-fasta with multi-line sequences
The following will work always:
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' headers.txt file.fasta
This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt
could contain only a part of the header.
回答5:
This awk
might work for you:
awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1
回答6:
One option is to create a long sed expression:
sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt
This will only read the file once. Note that I added the ^
and $
to the sed pattern (so >header1
doesn't match >header123
...)
Using a file (as @daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.
回答7:
try gnu sed,
sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f - first_file.txt
prepend time
command to both scripts to compare the speed,
look time while read line;do...
and time sed -....
result in my test this is done in less than half time of OP's
来源:https://stackoverflow.com/questions/55636069/remove-multiple-sequences-from-fasta-file