问题
I have a hash called %id2seq
that contains strings of DNA sequences that are referenced by the key $id
. I want to be able to manipulate the DNA sequences by using a position within the string as a reference. For example, if my DNA sequence was ACGTG
, my $id
would be Sequence 1
, my $id2seq{'Sequence 1'}
would be ACGTG
, and my "theoretical" $id2seq{'Sequence 1'}[3]
would be G
.
I am attempting to create a hash of arrays to do this, but I'm getting a weird output (see below output). I'm pretty sure that it's just my formatting Any input is helpful, and I appreciate in advance.
Here is a snippet of the input file:
>Sequence 1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence 2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence 3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG
Here is a snippet of my attempt at the moment. (I have a hash table that accesses a file with the DNA sequences commented out):
use strict;
use warnings;
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and store each dna seq in hash
my %id2seq = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
if($_ =~ /^>(.+)/)
{
$id = $1;
}
else
{
## $id2seq{$id} = $_; used to create hash table
@seqs = split '', $_;
$id2seq{$id} = [ @seqs ];
}
}
close FILE;
foreach $id (keys %id2seq)
{
print "$id2seq{$id}[@seqs]\n\n";
}
Output
Use of unitialized value in concatenation (.) or string at line 37.
T
G
A
T
T
回答1:
@seqs
contains characters from the last sequence. $id2seq{$id}[@seqs]
actually means $id2seq{$id}[N]
where N
is the length of the last sequence. So you print only one character from each sequence and get a warning if that sequence is shorter than the last one.
If you print
only for debugging it is easier with:
use Data::Dumper;
print Dumper(\%id2seq);
Otherwise you have to iterate over $id2seq{$id}
yourself in a nested loop.
回答2:
This line is incorrect:
print "$id2seq{$id}[@seqs]\n\n";
$id2seq{$id}
is an array ref, so the correct way to print it would be
print "@{ $id2seq{$id} }\n\n";
A complete example would be:
#!/usr/bin/perl
use warnings;
use strict;
my $current_id;
my %id2seq;
while (<DATA>) {
chomp;
if (/^>(.+)/) {
$current_id = $1;
} else {
$id2seq{$current_id} = [ split(//) ];
}
}
print "@{ $_ }\n" foreach (values %id2seq);
exit 0;
__DATA__
>Sequence 1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence 2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence 3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG
Test run:
$ perl dummy.pl
T C G A C C C T C T G G A A C C T A T C A G G G A C C A C A G T C A G C C A G G C A A G
C C C A C G C A G C C G C C C T C C T C C C C G G T C A C T G A C T G G T C C T G
T C A G A A C C A G T T A T A A A T T T A T C A T T T C C T T C T C C A C T C C T
回答3:
You need to print
$id2seq{$id}[3]\n\n";
To get the fourth value. Also, you never defined @seqs with 'my' so strict and warnings is complaining, thus the 'Use of unitialized value in concatenation (.) or string at line 37.'. Either remove warnings/strict or define @seqs
来源:https://stackoverflow.com/questions/55316872/creating-a-hash-of-arrays-for-dna-sequences-perl