问题
This is a question regarding a more efficient code design:
Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.
# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.
# Desired output
align = {"seq1":"ATGCATGC",
"seq2":"ATGC",
"seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,1], "gene2":[2,3]},
"seq3":{"gene1":[0,1], "gene2":[2,3]}}
Obtaining the desired output is less trivial than it may seem. Below I wrote some (line-numbered) pseudocode for this problem, but surely there is a more elegant design.
1 measure length of any aligned gene # take any seq, since all seqs aligned
2 list_lengths = list of gene lengths # order is important
3 for seq in alignment
4 outseq = ""
5 for each num in range(0, length(seq)) # weird for-loop is intentional
6 if seq[num] == "-"
7 current_gene = gene whose start/stop positions include num
8 subtract 1 from length of current_gene
9 subtract 1 from lengths of all genes following current_gene in list_lengths
10 else
11 append seq[num] to outseq
12 append outseq to new variable
13 convert gene lengths into start/stop positions and append ordered to new variable
Can anyone give me suggestions/examples for a shorter, more direct code design?
回答1:
This answer handles your updated annos
dictionary from the comment to cdlanes answer. That answer leaves the annos
dictionary with the incorrect index of [2,1] for seq2
gene2
. My proposed solution will remove the gene
entry from the dictionary if the sequence contains ALL gaps in that region. Also to note, if a gene contains only one letter in the final align
, then anno[geneX]
will have equal indices for start and stop --> See seq3
gene1
from your commented annos
.
align = {"seq1":"ATGCATGC",
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}
import re
for name,anno in annos.items():
# indices of gaps removed usinig re
removed = [(m.start(0)) for m in re.finditer(r'-', align[name])]
# removes gaps from align dictionary
align[name] = re.sub(r'-', '', align[name])
build_dna = ''
for gene,inds in anno.items():
start_ind = len(build_dna)+1
#generator to sum the num '-' removed from gene
num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1])
# build the de-gapped string
build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "")
end_ind = len(build_dna)
if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps
del annos[name][gene] #remove the gene entry
continue
#update the values in the annos dictionary
annos[name][gene][0] = start_ind-1
annos[name][gene][1] = end_ind-1
Results:
In [3]: annos
Out[3]: {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]},
'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
Results from the 3 gene annos
above. Just replace the annos
variable:
In [5]: annos3
Out[5]: {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]},
'seq2': {'gene1': [0, 1], 'gene3': [2, 3]},
'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}
回答2:
The following matches the output of example program for both test cases:
align = {"seq1":"ATGCATGC",
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
(START, STOP) = (0, 1)
for alignment, sequence in align.items():
new_sequence = ''
gap = 0
for position, codon in enumerate(sequence):
if '-' == codon:
for gene in annos[alignment].values():
if gene[START] > (position - gap):
gene[START] -= 1
if gene[STOP] >= (position - gap):
gene[STOP] -= 1
gap += 1
else:
new_sequence += codon
align[alignment] = new_sequence
The result of running this:
python3 -i test.py
>>> align
{'seq2': 'ATGC', 'seq1': 'ATGCATGC', 'seq3': 'ACAC'}
>>>
>>> annos
{'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
>>>
I hope this is still elegant, direct, short and Pythonic enough!
回答3:
My own solution to the above question is neither elegant nor Pythonic, but arrives at the desired output. Any recommendations for improvement are highly welcome!
import collections
import operator
# measure length of any aligned gene # take any seq, since all seqs aligned
align_len = len(align.itervalues().next())
# initialize output
align_out, annos_out = {}, {}
# loop through annos
for seqname, anno in annos.items():
# operate on ordered sequence lengths instead on ranges
ordseqlens = collections.OrderedDict()
# generate ordered sequence lengths
for k,v in sorted(anno.items(), key=operator.itemgetter(1)):
ordseqlens[k] = v[1]-v[0]+1
# start (and later append to) sequence output
align_out[seqname] = ""
# generate R-style for-loop
for pos in range(0, len(align[seqname])):
if align[seqname][pos] == "-":
try:
current_gene = next(key for key, a in anno.items() if a[0] <= pos <= a[1])
except StopIteration:
print("No annotation provided for position", pos, "in sequence", seqname)
# subtract 1 from lengths of current_gene
ordseqlens[current_gene] = ordseqlens[current_gene]-1
# append nucleotide unless a gap
else:
align_out[seqname] += align[seqname][pos]
# convert modified ordered sequence lengths back into start/stop positions
summ = 0
tmp_dict = {}
for k,v in ordseqlens.items():
tmp_dict[k] = [summ, v+summ-1]
summ = v+summ
# save start/stop positions to correct annos
annos_out[seqname] = tmp_dict
The output of this code is:
>>> align_out
{'seq3': 'ACAC',
'seq2': 'ATGC',
'seq1': 'ATGCATGC'}
>>> annos_out
{'seq3': {'gene1': [0, 1], 'gene2': [2, 3]},
'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}}
回答4:
So, I think that the approach of trying to break each sequence up into genes and then remove the dashes is resulting in a lot of unnecessary book-keeping. Instead, it might be easier to look at the dashes directly and then update all of the indices based on their relative positions. Here's a function I wrote that appears to be operating correctly:
from copy import copy
def rewriteGenes(align, annos):
alignments = copy(align)
annotations = copy(annos)
for sequence, alignment in alignments.items():
while alignment.find('-') > -1:
index = alignment.find('-')
for gene, (start, end) in annotations[sequence].items():
if index < start:
annotations[sequence][gene][0] -= 1
if index <= end:
annotations[sequence][gene][1] -= 1
alignment = alignment[:index] + alignment[index+1:]
alignments[sequence] = alignment
return (alignments, annotations)
This iterates over the dashes in each alignment and updates the gene indices as they are removed.
Note that this produces a gene with indices [2,1]
for the following test case:
align = {"seq1":"ATGCATGC",
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}
This is necessary because the way your indices are setup do not otherwise allow for empty genes. For example, the indices [2,2]
would be the sequence of length 1 starting at index 2.
来源:https://stackoverflow.com/questions/34816513/improving-code-design-of-dna-alignment-degapping