问题
I am trying to write a python code that finds restriction enzyme sites within a sequence of DNA. Restriction enzymes cut at specific DNA sequences, however some are not so strict, for example XmnI cuts this sequence:
GAANNNNTTC
Where N can be any nucleotide (A, C, G, or T). If my math is right thats 4^4 = 256 unique sequences that it can cut. I want to make a list of these 256 short sequences, then check each one against a (longer) input DNA sequence. However, I'm having a hard time generating the 256 sequences. Here's what I have so far:
cutsequencequery = "GAANNNNTTC"
Nseq = ["A", "C", "G", "T"]
querylist = []
if "N" in cutsequencequery:
Nlist = [cutsequencequery.replace("N", t) for t in Nseq]
for j in list(Nlist):
querylist.append(j)
for i in querylist:
print(i)
print(len(querylist))
and here is the output:
GAAAAAATTC
GAACCCCTTC
GAAGGGGTTC
GAATTTTTTC
4
So it's switching each N to either A, C, G, and T, but I think I need another loop (or 3?) to generate all 256 combinations. Is there an efficient way to do this that I'm not seeing?
回答1:
Maybe you should take a look into python's itertools library, which include product
which creates an iterable with every combination of iterables, therefore:
from itertools import product
cutsequencequery = "GAANNNNTTC"
nseq = ["A", "C", "G", "T"]
size = cutsequencequery.count('N')
possibilities = product(*[nseq for i in range(size)])
# = ('A', 'A', 'A', 'A'), ... , ('T', 'T', 'T', 'T')
# len(list(possibilities)) = 256 = 4^4, as expected
s = set()
for n in possibilities:
print(''.join(n)) # = 'AAAA', ..., 'TTTT'
new_sequence = cutsequencequery.replace('N' * size, ''.join(n))
s.add(new_sequence)
print(new_sequence) # = 'GAAAAAATTC', ..., 'GAATTTTTTC'
print(len(s)) # 256 unique sequences
来源:https://stackoverflow.com/questions/64435037/python-switching-multiple-positions-in-string-each-to-multiple-letters