biopython | 易学教程

How do I pass Biopython SeqIO.convert() over multiple files in a directory?

阅读更多关于 How do I pass Biopython SeqIO.convert() over multiple files in a directory?

问题 I’m writing a python script (version 2.7) that will change every input file (.nexus format) within the specified directory into .fasta format. The Biopython module SeqIO.convert handles the conversion perfectly for individually specified files but when I try to automate the process over a directory using os.walk I’m unable to correctly pass the pathname of each input file to SeqIO.convert. Where are I going wrong? Do I need to use join() from os.path module and pass the full path names on to

How to extract short sequence using window with specific step size?

阅读更多关于 How to extract short sequence using window with specific step size?

问题 The code below extract short sequence in every sequence with the window size 4. How to shift the window by step size 2 and extract 4 base pairs? Example code from Bio import SeqIO with open("testA_out.fasta","w") as f: for seq_record in SeqIO.parse("testA.fasta", "fasta"): i = 0 while ((i+4) < len(seq_record.seq)) : f.write(">" + str(seq_record.id) + "\n") f.write(str(seq_record.seq[i:i+4]) + "\n") i += 2 Example Input of testA.fasta >human1 ACCCGATTT Example Output of testA_out >human1 ACCC

Biopython: How to avoid particular amino acid sequences from a protein so as to plot Ramachandran plot?

阅读更多关于 Biopython: How to avoid particular amino acid sequences from a protein so as to plot Ramachandran plot?

问题 I have written a python script to plot the 'Ramachandran Plot' of Ubiquitin protein. I am using biopython. I am working with pdb files. My script is as below : import Bio.PDB import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt phi_psi = ([0,0]) phi_psi = np.array(phi_psi) pdb1 ='/home/devanandt/Documents/VMD/1UBQ.pdb' for model in Bio.PDB.PDBParser().get_structure('1UBQ',pdb1) : for chain in model : polypeptides = Bio.PDB.PPBuilder().build_peptides(chain) for poly

Extract sequences from a FASTA file based on entries in a separate file

阅读更多关于 Extract sequences from a FASTA file based on entries in a separate file

问题 I have two files. File 1: a FASTA file with gene sequences, formated like this example: >PITG_00002 | Phytophthora infestans T30-4 conserved hypothetical protein (426 nt) ATGCATCGCTCGGGTTCCGCACGGAAAGCCCAAGGTCTGGGATTACGGGGTGGTGGTCGG TTACACTTGGAATAACCTCGCAAATTCAGAATCTCTACAGGCTACGTTCGCGGATGGAAC >PITG_00003 | Phytophthora infestans T30-4 protein kinase (297 nt) ATGACGGCTGGGGTCGGTACGCCCTACTGGATCGCACCGGAGATTCTTGAAGGCAAACGG TACACTGAGCAAGCGGATATTTACTCGTTCGGAGTGGTTTTATCCGAGCTGGACACGTGC

Remove duplicated fasta sequence (bash of biopython method)

阅读更多关于 Remove duplicated fasta sequence (bash of biopython method)

问题 Hello I have a fasta file such as : >sequence1_CP [seq virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence2 [virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence3 MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence4_CP hypothetical protein [another virus]

How to find a open reading frame in Python

阅读更多关于 How to find a open reading frame in Python

问题 I am using Python and a regular expression to find an ORF (open reading frame). Find a sub-string a string that is composed ONLY of the letters ATGC (no spaces or new lines) that: Starts with ATG , ends with TAG or TAA or TGA and should consider the sequence from the first character, then second and then third: Seq= "CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGC AAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGA

Why installing package and module not same in Python?

阅读更多关于 Why installing package and module not same in Python?

问题 I want to install the Biopython module. So I used the command sudo apt-get install python-biopython . That installs the package . Now if I type import Bio in Python, the compiler cannot find the module giving ImportError: no module named Bio . Doesn't installing the package imply installing the module? 回答1: I'm using 3.4.0 Since you are using Python 3.4 it won't work because the Debian package you install via apt-get will only install the Python2.x version. To install the Python 3 version, I

Muscle alignment in python

阅读更多关于 Muscle alignment in python

问题 I have a problem with printing my output from muscle aligning in python. My code is: from Bio.Align.Applications import MuscleCommandline from StringIO import StringIO from Bio import AlignIO def align_v1 (Fasta): muscle_cline = MuscleCommandline(input="hiv_protease_sequences_w_wt.fasta") stdout, stderr = muscle_cline() MultipleSeqAlignment = AlignIO.read(StringIO(stdout), "fasta") print MultipleSeqAlignment Any help? 回答1: It would be nice to know what error you received, but the following

Reading an entire directory of .pdb files using BioPython

阅读更多关于 Reading an entire directory of .pdb files using BioPython

问题 I was recently tasked to write a program in python to find atoms within 2 angstroms distance from every metal in a protein from a .pdb (Protein Data Bank). This is the script I wrote for it. from Bio.PDB import * parser = PDBParser(PERMISSIVE=True) def print_coordinates(list): neighborList = list for y in neighborList: print " ", y.get_coord() structure_id = '5m6n' fileName = '5m6n.pdb' structure = parser.get_structure(structure_id, fileName) atomList = Selection.unfold_entities(structure, 'A

Filtering a FASTA file based on sequence with BioPython

阅读更多关于 Filtering a FASTA file based on sequence with BioPython

问题 I have a fasta file. From that file, I need to get the only sequences containing GTACAGTAGG and CAACGGTTTTGCC at the end and/or start of the sequence and put them in a new fasta file. So here's an example: >m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/2516_3269 ***GTACAGTAGG***GTACACACAGAACGCGACAAGGCCAGGCGCTGGAGGAACTCCAGCAGCTAGATGCAAGCGACTA TCAGAGCGTTGGGTCCAGAACGAAGAACAGTCACTCAAGACTGCTTT***CAACGGTTTTGCC*** >m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7