问题
The code below extract short sequence in every sequence with the window size 4. How to shift the window by step size 2 and extract 4 base pairs?
Example code
from Bio import SeqIO
with open("testA_out.fasta","w") as f:
for seq_record in SeqIO.parse("testA.fasta", "fasta"):
i = 0
while ((i+4) < len(seq_record.seq)) :
f.write(">" + str(seq_record.id) + "\n")
f.write(str(seq_record.seq[i:i+4]) + "\n")
i += 2
Example Input of testA.fasta
>human1
ACCCGATTT
Example Output of testA_out
>human1
ACCC
>human1
CCGA
>human1
GATT
The problem with this output is that there are one T left out so in this case I hope to include it as well. How can I come out with this output? With a reverse extract as well to include base pairs that are probably left out when extract from start to end. Can anyone help me?
Expected output
>human1
ACCC
>human1
CCGA
>human1
GATT
>human1
ATTT
>human1
CGAT
>human1
CCCG
回答1:
You can use a for
loop with range
, using the third step
parameter for range
. This way, it's a bit cleaner than using a while
loop. If the data can not be divided by the chunk size, then the last chunk will be smaller.
data = "ACCCGATTT"
step = 2
chunk = 4
for i in range(0, len(data) - step, step):
print(data[i:i+chunk])
Output is
ACCC
CCGA
GATT
TTT
回答2:
For any window size and any step size:
fasta='ACCCGATTT'
windowSize=4
step=1
i=0
while (i+windowSize)<=len(fasta):
currentWindow=fasta[i:i+windowSize]
print(currentWindow)
i+=step
Output with windowSize=4, step=2:
ACCC
CCGA
GATT
Output with windowSize=4, step=1:
ACCC
CCCG
CCGA
CGAT
GATT
ATTT
The last one is exactly as "Expected output", sorted differently.
回答3:
Your particular example can be solved by moving to a step size of 1 instead. But your question seems to be asking, "how do I repeat with the same window size from the end of the sequence if there are not enough characters in the sequence". So an example where this would make a difference might be
AAAATTT
with a window size of 6 and a step size 2, where you want AAAATT
from the "forward" direction, and AAATTT
from the "reverse" direction, but no other subsequences.
Obviously, running the code once in the forward direction and once in the backwards direction would do that, but it introduces repetition, which is usually not a good thing. However, you can refactor the problem so that you divide the step into pairs of steps.
For a sequence of length x with a step of y, you can divide y into x%y and y-(x%y) and just move forward with these pairwise steps. (Skip the first member of the pair when x%y == 0.)
I'm posting just the string handling functions, as none of this is at all specific to gene sequences.
seq = "AAAATTT"
window = 6
step = 2
length = len(seq)
modulo = length % step
for i in range(0, length-window, step):
if modulo > 0:
print(seq[i:i+window])
print(seq[i+modulo:i+modulo+window])
来源:https://stackoverflow.com/questions/30797796/how-to-extract-short-sequence-using-window-with-specific-step-size