Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

问题

I dont know if this is just a quirk with Stawberry Perl, but I can't seem to get it to run. I just need to take a fasta and reverse every sequence in it.

-The problem-

I have a multifasta file:

>seq1
ABCDEFG
>seq2
HIJKLMN

and the expected output is:

>REVseq1
GFEDCBA
>REVseq2
NMLKJIH

The script is here:

$NUM_COL = 80; ## set the column width of output file
$infile = shift; ## grab input sequence file name from command line
$outfile = "test1.txt"; ## name output file, prepend with “REV”
open (my $IN, $infile);
open (my $OUT, '>', $outfile);
$/ = undef; ## allow entire input sequence file to be read into memory
my $text = <$IN>; ## read input sequence file into memory
print $text; ## output sequence file into new decoy sequence file
my @proteins = split (/>/, $text); ## put all input sequences into an array


for my $protein (@proteins) { ## evaluate each input sequence individually
    $protein =~ s/(^.*)\n//m; ## match and remove the first descriptive line of
    ## the FATA-formatted protein
    my $name = $1; ## remember the name of the input sequence
    print $OUT ">REV$name\n"; ## prepend with #REV#; a # will help make the
    ## protein stand out in a list
    $protein =~ s/\n//gm; ## remove newline characters from sequence
    $protein = reverse($protein); ## reverse the sequence

    while (length ($protein) > $NUM_C0L) { ## loop to print sequence with set number of cols

    $protein =~ s/(.{$NUM_C0L})//;
    my $line = $1;
    print $OUT "$line\n";
    }
    print $OUT "$protein\n"; ## print last portion of reversed protein
}

close ($IN);
close ($OUT);
print "done\n";

回答1:

This will do as you ask

It builds a hash %fasta out of the FASTA file, keeping array @keys to keep the sequences in order, and then prints out each element of the hash

Each line of the sequence is reversed using reverse before it is added to the hash, and using unshift adds the lines of the sequence in reverse order

The program expects the input file as a parameter on the command line, and prints the result to STDOUT, which may be redirected on the command line

use strict;
use warnings 'all';

my (%fasta, @keys);

{
    my $key;

    while ( <> ) {

        chomp;

        if ( s/^>\K/REV/ ) {
            $key = $_;
            push @keys, $key;
        }
        elsif ( $key ) {
            unshift @{ $fasta{$key} }, scalar reverse;
        }
    }
}

for my $key ( @keys ) {
    print $key, "\n";
    print "$_\n" for @{ $fasta{$key} };
}

output

>REVseq1
GFEDCBA
>REVseq2
NMLKJIH

Update

If you prefer to rewrap the sequence so that short lines are at the end, then you just need to rewrite the code that dumps the hash

This alternative uses the length of the longest line in the original file as the limit, and rerwraps the reversed sequence to the same length. It's claer that it would be simple to specify an explicit length instead of calculating it

You will need to add use List::Util 'max' at the top of the program

my $len = max map length, map @$_, values %fasta;

for my $key ( @keys ) {
    print $key, "\n";
    my $seq = join '', @{ $fasta{$key} };
    print "$_\n" for $seq =~ /.{1,$len}/g;
}

Given the original data the output is identical to that of the solution above. I used this as input

>seq1
ABCDEFGHI
JKLMNOPQRST
UVWXYZ
>seq2
HIJKLMN
OPQRSTU
VWXY

with this result. All lines have been wrapped to eleven characters - the length of the longest JKLMNOPQRST line in the original data

>REVseq1
ZYXWVUTSRQP
ONMLKJIHGFE
DCBA
>REVseq2
YXWVUTSRQPO
NMLKJIH

回答2:

I don't know if this is just for a class that uses toy datasets or actual research FASTAs that can be gigabytes in size. If the latter, it would make sense not to keep the whole data set in memory as both your program and Borodin's do but read it one sequence at a time, print that out reversed and forget about it. The following code does that and also deals with FASTA files that may have asterisks as sequence-end markers as long as they start with >, not ;.

#!/usr/bin/perl
use strict;
use warnings;

my $COL_WIDTH = 80;

my $sequence = '';
my $seq_label;

sub print_reverse {
    my $seq_label = shift;
    my $sequence = reverse shift;
    return unless $sequence;
    print "$seq_label\n";
    for(my $i=0; $i<length($sequence); $i += $COL_WIDTH) {
        print substr($sequence, $i, $COL_WIDTH), "\n";
    }
}

while(my $line = <>) {
    chomp $line;
    if($line =~ s/^>/>REV/) {
        print_reverse($seq_label, $sequence);
        $seq_label = $line;
        $sequence = '';
        next;
    }
    $line = substr($line, 0, -1) if substr($line, -1) eq '*';
    $sequence .= $line;
}
print_reverse($seq_label, $sequence);

来源：https://stackoverflow.com/questions/38775912/write-a-perl-script-that-takes-in-a-fasta-and-reverses-all-the-sequences-withou

标签

perl

bioinformatics

fasta