问题
Currently working on CS50. I tried to count STR in file DNA Sequences but it always overcount.
I mean, for example: how much 'AGATC' in file DNA repeat consecutively.
This code is only try to find out how to count those repeated DNA accurately.
import csv
import re
from sys import argv, exit
def main():
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
with open(argv[1]) as csv_file, open(argv[2]) as dna_file:
reader = csv.reader(csv_file)
#for row in reader:
# print(row)
str_sequences = next(reader)[1:]
dna = dna_file.read()
for i in range(len(dna)):
count = len(re.findall(str_sequences[0], dna)) # str_sequences[0] is 'AGATC'
print(count)
main()
result for DNA file 11 (AGATC):
$ python dna.py databases/large.csv sequences/11.txt
52
The result supposed to be 43. But, for small.csv, its count accurately. But for large it always over count. Later i know that my code its counting all every match word in DNA file (AGATC). But the task is, you have to take the DNA that only repeat consecutively and ignoring if another same DNA showup again.
{AGATCAGATCAGATCAGATC(T)TTTTAGATC}
So, how to stop counting if the DNA hit the (T), and it doesn't need to count AGATC that comes after? What should i change in my code? especially in re.findall() that i use. Some people said use substring, how to use substring? or maybe can i just use regEx like i did?
Please write your code if you can. sorry for my bad english.
回答1:
The for loop is wrong, it will keep counting the sequences even if they are already found earlier in the loop. I think you want to instead loop over the str_sequences
.
Something like:
seq_list = []
for STR in str_sequences:
groups = re.findall(rf'(?:{STR})+', dna)
if len(groups) == 0:
seq_list.append('0')
else:
seq_list.append(str(max(map(lambda x: len(x)//len(STR), groups))))
print(seq_list)
Also, there are many posts on this problem. Maybe, you can examine some of them to finish your program.
来源:https://stackoverflow.com/questions/64125727/counting-repeated-str-in-dna-pset6-cs50