Creating a hash of arrays for DNA sequences, Perl

问题

I have a hash called %id2seq that contains strings of DNA sequences that are referenced by the key $id. I want to be able to manipulate the DNA sequences by using a position within the string as a reference. For example, if my DNA sequence was ACGTG, my $id would be Sequence 1, my $id2seq{'Sequence 1'} would be ACGTG, and my "theoretical" $id2seq{'Sequence 1'}[3] would be G. I am attempting to create a hash of arrays to do this, but I'm getting a weird output (see below output). I'm pretty sure that it's just my formatting Any input is helpful, and I appreciate in advance.

Here is a snippet of the input file:

>Sequence 1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence 2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence 3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

Here is a snippet of my attempt at the moment. (I have a hash table that accesses a file with the DNA sequences commented out):

use strict;
use warnings;

print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;

#Remove newline from file
chomp $filename1;

#Open the file and store each dna seq in hash
my %id2seq = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
    if($_ =~ /^>(.+)/)
    {
        $id = $1;
    }
    else
    {
        ## $id2seq{$id} = $_; used to create hash table
        @seqs = split '', $_;
        $id2seq{$id} = [ @seqs ];
    }
}
close FILE;
foreach $id (keys %id2seq)
{
    print "$id2seq{$id}[@seqs]\n\n";
}

Output

Use of unitialized value in concatenation (.) or string at line 37.


T

G

A

T

T

回答1:

@seqs contains characters from the last sequence. $id2seq{$id}[@seqs] actually means $id2seq{$id}[N] where N is the length of the last sequence. So you print only one character from each sequence and get a warning if that sequence is shorter than the last one.

If you print only for debugging it is easier with:

use Data::Dumper;
print Dumper(\%id2seq);

Otherwise you have to iterate over $id2seq{$id} yourself in a nested loop.

回答2:

This line is incorrect:

print "$id2seq{$id}[@seqs]\n\n";

$id2seq{$id} is an array ref, so the correct way to print it would be

print "@{ $id2seq{$id} }\n\n";

A complete example would be:

#!/usr/bin/perl
use warnings;
use strict;

my $current_id;
my %id2seq;
while (<DATA>) {
    chomp;
    if (/^>(.+)/) {
        $current_id = $1;
    } else {
        $id2seq{$current_id} = [ split(//) ];
    }
}

print "@{ $_ }\n" foreach (values %id2seq);

exit 0;

__DATA__
>Sequence 1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence 2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence 3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

Test run:

$ perl dummy.pl
T C G A C C C T C T G G A A C C T A T C A G G G A C C A C A G T C A G C C A G G C A A G
C C C A C G C A G C C G C C C T C C T C C C C G G T C A C T G A C T G G T C C T G
T C A G A A C C A G T T A T A A A T T T A T C A T T T C C T T C T C C A C T C C T

回答3:

You need to print

$id2seq{$id}[3]\n\n";

To get the fourth value. Also, you never defined @seqs with 'my' so strict and warnings is complaining, thus the 'Use of unitialized value in concatenation (.) or string at line 37.'. Either remove warnings/strict or define @seqs

来源：https://stackoverflow.com/questions/55316872/creating-a-hash-of-arrays-for-dna-sequences-perl

标签

arrays

perl

hash

dna-sequence