问题
I have a data in that always comes in block of four in the following format (called FASTQ):
@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/
Is there a simple sed/awk/bash way to convert them into this format (called FASTA):
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
In principle, we want to extract the first two lines in each block-of-4
and replace @
with >
.
回答1:
This is an old question, and there have been many different solutions offered. Since the accepted answer uses sed but has a glaring problem (which is that it will replace @ with > when the @ sign appears as the first letter of the quality line), I feel compelled to offer a simple sed-based solution that actually works:
sed -n '1~4s/^@/>/p;2~4p'
The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.
The fastq_to_fasta script in the fastx toolkit also works. (It's worth mentioning that you need to specify the -Q33 option to accommodate the now common Phred+33 qual encodings. Which is funny, since it's throwing away the quality data anyway!)
回答2:
sed ain't dead. If we're golfing:
sed '/^@/!d;s//>/;N'
Or, emulating http://www.ringtail.tsl.ac.uk/david-studholme/scripts/fastq2fasta.pl posted by Pierre, which only prints the first word (the id) from the first line and does (some) error handling:
#!/usr/bin/sed -f
# Read a total of four lines
$b error
N;$b error
N;$b error
N
# Parse the lines
/^@\(\([^ ]*\).*\)\(\n[ACGTN]*\)\n+\1\n.*$/{
# Output id and sequence for FASTA format.
s//>\2\3/
b
}
:error
i\
Error parsing input:
q
There seem to be plenty of existing tools for converting these formats; you should probably use these instead of anything posted here (including the above).
回答3:
As detailed in Cock, et al (2009) NAR, many of these solutions are incorrect since "the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence."
See http://ukpmc.ac.uk/articlerender.cgi?accid=PMC2847217 for details.
回答4:
just awk , no need other tools
# awk '/^@SR/{gsub(/^@/,">",$1);print;getline;print}' file
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
回答5:
See fastq2fasta.pl in http://www.ringtail.tsl.ac.uk/david-studholme/scripts/
回答6:
I'd write
awk '
NR%4 == 1 {print ">" substr($0, 2)}
NR%4 == 2 {print}
' fastq > fasta
回答7:
This is the fastest I've got, and I stuck it in my .bashrc file:
alias fq2fa="awk '{print \">\" substr(\$0,2);getline;print;getline;getline}'"
It doesn't fail on the infrequent but not impossible quality lines that start with @... but does fail on wrapped FASTQ, if that's even legal (it exists though).
回答8:
Here's the solution to the "skip every other line" part of the problem that I just learned from SO:
while read line
do
# print two lines
echo "$line"
read line_to_print
echo "$line_to_print"
# and skip two lines
read line_to_skip
read line_to_skip
done
If all that needs to be done is change one @
to >
, then I reckon
while read line
do
echo "$line" | sed 's/@/>/'
read line
echo "$line"
read line_to_skip
read line_to_skip
done
will do the job.
回答9:
Something like:
awk 'BEGIN{a=0}{if(a==1){print;a=0}}/^@/{print;a=1}' myFastqFile | sed 's/^@/>/'
should work.
回答10:
I think, with gnu grep this could be done with this:
grep -A 1 "^@" t.txt | grep -v "^--" | sed -e "s/^@/\>/"
回答11:
awk 'BEGIN{P=1}{if(P==1||P==2){gsub(/^[@]/,">");print}; if(P==4)P=0; P++}' data
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
below
awk '{gsub(/^[@]/,">"); print}' data
where data is your data file. I've received:
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/
回答12:
I know I'm way in the future, but for the benefit of googlers:
You may want to use fastq_to_fasta from the fastx toolkit. It will keep the @ sign, though. It will also remove lines with Ns unless you tell it not to.
回答13:
You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files
bioawk -c fastx '{ print ">"$name ORS $seq }' file.fastq
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.
来源:https://stackoverflow.com/questions/1542306/converting-fastq-to-fasta-with-sed-awk