问题
I have multiple fasta files with 1000s of seqs in each file of varying length. I would like to keep only the first 200 (n) bases from each sequence. How can I do this in Perl?
回答1:
Difficult to understand exactly what you mean without seeing an example but if you only need the first 200 characters per line just use cut
:
cut -c1-200 file
回答2:
If the sequence is printed on several physical lines, only print up through the 200th character. A line starting with a wedge is a header line, which indicates the start of a new sequence.
awk '/^>/{ seqlen=0; print; next; }
seqlen < 200 { if (seqlen + length($0) > 200)
$0 = substr($0, 1, 200-seqlen);
seqlen += length($0); print }' file.fasta >newfile.fasta
Oh, in Perl?
perl -nle 'if (/^>/) { $seqlen = 0; print; next }
next if ($seqlen >= 200);
$_ = substr($_, 0, 200-$seqlen) if ($seqlen + length($_) > 200);
$seqlen += length($_);
print;' file.fasta >newfile.fasta
回答3:
If the sequence is too long, keep only the interesting part:
$/ = '>';
<>;
while (my $seq = <>) {
$seq =~ s/>$//;
$seq =~ s/^(.*)//;
my $id = $1;
$seq =~ s/\n//g;
$seq = substr $seq, 0, 200;
print ">$id\n$seq\n";
}
回答4:
I recommend that you consider using BioPerl for this sort of thing because it is very easy to accomplish these tasks and you don't have to worry about things like formatting. In the code below, the first argument to the script is your fasta and the second argument is a file to hold only the first 200 bases of each sequence.
#!/usr/bin/env perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
my $usage = "$0 infile outfile\n";
my $infile = shift or die $usage;
my $outfile = shift or die $usage;
my $seqin = Bio::SeqIO->new(-file => $infile, -format => 'fasta');
my $seqout = Bio::SeqIO->new(-file => ">$outfile", -format => 'fasta');
while (my $seq = $seqin->next_seq) {
my $first200 = $seq->subseq(1,200); # 1-based
my $subseq = Bio::Seq->new(-seq => $first200, -id => $seq->id);
$seqout->write_seq($subseq);
}
回答5:
Here's how i solve it, if anyone interested in trying a another way to do it i used a tool included in biolinux called Fasta_formatter to put the actual sequence in one line (-w 0), then trimming as @sudo_O said, and then finally back to the 80 letters width.
fasta_formatter -w 0 < FILE | cut -c1-LENGTH | fasta_formatter -w 80 > TRIMMED_FILE
来源:https://stackoverflow.com/questions/16335327/fasta-delete-sequences-after-n-length