问题
I dont know if this is just a quirk with Stawberry Perl, but I can't seem to get it to run. I just need to take a fasta and reverse every sequence in it.
-The problem-
I have a multifasta file:
>seq1
ABCDEFG
>seq2
HIJKLMN
and the expected output is:
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
The script is here:
$NUM_COL = 80; ## set the column width of output file
$infile = shift; ## grab input sequence file name from command line
$outfile = "test1.txt"; ## name output file, prepend with “REV”
open (my $IN, $infile);
open (my $OUT, '>', $outfile);
$/ = undef; ## allow entire input sequence file to be read into memory
my $text = <$IN>; ## read input sequence file into memory
print $text; ## output sequence file into new decoy sequence file
my @proteins = split (/>/, $text); ## put all input sequences into an array
for my $protein (@proteins) { ## evaluate each input sequence individually
$protein =~ s/(^.*)\n//m; ## match and remove the first descriptive line of
## the FATA-formatted protein
my $name = $1; ## remember the name of the input sequence
print $OUT ">REV$name\n"; ## prepend with #REV#; a # will help make the
## protein stand out in a list
$protein =~ s/\n//gm; ## remove newline characters from sequence
$protein = reverse($protein); ## reverse the sequence
while (length ($protein) > $NUM_C0L) { ## loop to print sequence with set number of cols
$protein =~ s/(.{$NUM_C0L})//;
my $line = $1;
print $OUT "$line\n";
}
print $OUT "$protein\n"; ## print last portion of reversed protein
}
close ($IN);
close ($OUT);
print "done\n";
回答1:
This will do as you ask
It builds a hash %fasta
out of the FASTA file, keeping array @keys
to keep the sequences in order, and then prints out each element of the hash
Each line of the sequence is reversed using reverse
before it is added to the hash, and using unshift
adds the lines of the sequence in reverse order
The program expects the input file as a parameter on the command line, and prints the result to STDOUT, which may be redirected on the command line
use strict;
use warnings 'all';
my (%fasta, @keys);
{
my $key;
while ( <> ) {
chomp;
if ( s/^>\K/REV/ ) {
$key = $_;
push @keys, $key;
}
elsif ( $key ) {
unshift @{ $fasta{$key} }, scalar reverse;
}
}
}
for my $key ( @keys ) {
print $key, "\n";
print "$_\n" for @{ $fasta{$key} };
}
output
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
Update
If you prefer to rewrap the sequence so that short lines are at the end, then you just need to rewrite the code that dumps the hash
This alternative uses the length of the longest line in the original file as the limit, and rerwraps the reversed sequence to the same length. It's claer that it would be simple to specify an explicit length instead of calculating it
You will need to add use List::Util 'max'
at the top of the program
my $len = max map length, map @$_, values %fasta;
for my $key ( @keys ) {
print $key, "\n";
my $seq = join '', @{ $fasta{$key} };
print "$_\n" for $seq =~ /.{1,$len}/g;
}
Given the original data the output is identical to that of the solution above. I used this as input
>seq1
ABCDEFGHI
JKLMNOPQRST
UVWXYZ
>seq2
HIJKLMN
OPQRSTU
VWXY
with this result. All lines have been wrapped to eleven characters - the length of the longest JKLMNOPQRST
line in the original data
>REVseq1
ZYXWVUTSRQP
ONMLKJIHGFE
DCBA
>REVseq2
YXWVUTSRQPO
NMLKJIH
回答2:
I don't know if this is just for a class that uses toy datasets or actual research FASTAs that can be gigabytes in size. If the latter, it would make sense not to keep the whole data set in memory as both your program and Borodin's do but read it one sequence at a time, print that out reversed and forget about it. The following code does that and also deals with FASTA files that may have asterisks as sequence-end markers as long as they start with >
, not ;
.
#!/usr/bin/perl
use strict;
use warnings;
my $COL_WIDTH = 80;
my $sequence = '';
my $seq_label;
sub print_reverse {
my $seq_label = shift;
my $sequence = reverse shift;
return unless $sequence;
print "$seq_label\n";
for(my $i=0; $i<length($sequence); $i += $COL_WIDTH) {
print substr($sequence, $i, $COL_WIDTH), "\n";
}
}
while(my $line = <>) {
chomp $line;
if($line =~ s/^>/>REV/) {
print_reverse($seq_label, $sequence);
$seq_label = $line;
$sequence = '';
next;
}
$line = substr($line, 0, -1) if substr($line, -1) eq '*';
$sequence .= $line;
}
print_reverse($seq_label, $sequence);
来源:https://stackoverflow.com/questions/38775912/write-a-perl-script-that-takes-in-a-fasta-and-reverses-all-the-sequences-withou