fasta: delete sequences after n length

大城市里の小女人 提交于 2019-12-13 19:33:30

问题


I have multiple fasta files with 1000s of seqs in each file of varying length. I would like to keep only the first 200 (n) bases from each sequence. How can I do this in Perl?


回答1:


Difficult to understand exactly what you mean without seeing an example but if you only need the first 200 characters per line just use cut:

cut -c1-200 file



回答2:


If the sequence is printed on several physical lines, only print up through the 200th character. A line starting with a wedge is a header line, which indicates the start of a new sequence.

awk '/^>/{ seqlen=0; print; next; }
    seqlen < 200 { if (seqlen + length($0) > 200)
            $0 = substr($0, 1, 200-seqlen);
        seqlen += length($0); print }' file.fasta >newfile.fasta

Oh, in Perl?

perl -nle 'if (/^>/) { $seqlen = 0; print; next }
    next if ($seqlen >= 200);
    $_ = substr($_, 0, 200-$seqlen) if ($seqlen + length($_) > 200);
    $seqlen += length($_);
    print;' file.fasta >newfile.fasta



回答3:


If the sequence is too long, keep only the interesting part:

$/ = '>';
<>;
while (my $seq = <>) {
    $seq =~ s/>$//;
    $seq =~ s/^(.*)//;
    my $id = $1;
    $seq =~ s/\n//g;
    $seq = substr $seq, 0, 200;
    print ">$id\n$seq\n";
}



回答4:


I recommend that you consider using BioPerl for this sort of thing because it is very easy to accomplish these tasks and you don't have to worry about things like formatting. In the code below, the first argument to the script is your fasta and the second argument is a file to hold only the first 200 bases of each sequence.

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;

my $usage = "$0 infile outfile\n";
my $infile = shift or die $usage;
my $outfile = shift or die $usage;

my $seqin = Bio::SeqIO->new(-file => $infile, -format => 'fasta');
my $seqout = Bio::SeqIO->new(-file => ">$outfile", -format => 'fasta');

while (my $seq = $seqin->next_seq) {
    my $first200 = $seq->subseq(1,200); # 1-based
    my $subseq = Bio::Seq->new(-seq => $first200, -id => $seq->id);
    $seqout->write_seq($subseq);
}



回答5:


Here's how i solve it, if anyone interested in trying a another way to do it i used a tool included in biolinux called Fasta_formatter to put the actual sequence in one line (-w 0), then trimming as @sudo_O said, and then finally back to the 80 letters width.

fasta_formatter -w 0 < FILE | cut -c1-LENGTH | fasta_formatter -w 80 > TRIMMED_FILE


来源:https://stackoverflow.com/questions/16335327/fasta-delete-sequences-after-n-length

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!