biopython

How do I pass Biopython SeqIO.convert() over multiple files in a directory?

。_饼干妹妹 提交于 2019-12-23 05:20:38
问题 I’m writing a python script (version 2.7) that will change every input file (.nexus format) within the specified directory into .fasta format. The Biopython module SeqIO.convert handles the conversion perfectly for individually specified files but when I try to automate the process over a directory using os.walk I’m unable to correctly pass the pathname of each input file to SeqIO.convert. Where are I going wrong? Do I need to use join() from os.path module and pass the full path names on to

How to extract short sequence using window with specific step size?

穿精又带淫゛_ 提交于 2019-12-23 02:29:52
问题 The code below extract short sequence in every sequence with the window size 4. How to shift the window by step size 2 and extract 4 base pairs? Example code from Bio import SeqIO with open("testA_out.fasta","w") as f: for seq_record in SeqIO.parse("testA.fasta", "fasta"): i = 0 while ((i+4) < len(seq_record.seq)) : f.write(">" + str(seq_record.id) + "\n") f.write(str(seq_record.seq[i:i+4]) + "\n") i += 2 Example Input of testA.fasta >human1 ACCCGATTT Example Output of testA_out >human1 ACCC

Biopython: How to avoid particular amino acid sequences from a protein so as to plot Ramachandran plot?

陌路散爱 提交于 2019-12-22 09:26:05
问题 I have written a python script to plot the 'Ramachandran Plot' of Ubiquitin protein. I am using biopython. I am working with pdb files. My script is as below : import Bio.PDB import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt phi_psi = ([0,0]) phi_psi = np.array(phi_psi) pdb1 ='/home/devanandt/Documents/VMD/1UBQ.pdb' for model in Bio.PDB.PDBParser().get_structure('1UBQ',pdb1) : for chain in model : polypeptides = Bio.PDB.PPBuilder().build_peptides(chain) for poly

Extract sequences from a FASTA file based on entries in a separate file

怎甘沉沦 提交于 2019-12-18 18:28:24
问题 I have two files. File 1: a FASTA file with gene sequences, formated like this example: >PITG_00002 | Phytophthora infestans T30-4 conserved hypothetical protein (426 nt) ATGCATCGCTCGGGTTCCGCACGGAAAGCCCAAGGTCTGGGATTACGGGGTGGTGGTCGG TTACACTTGGAATAACCTCGCAAATTCAGAATCTCTACAGGCTACGTTCGCGGATGGAAC >PITG_00003 | Phytophthora infestans T30-4 protein kinase (297 nt) ATGACGGCTGGGGTCGGTACGCCCTACTGGATCGCACCGGAGATTCTTGAAGGCAAACGG TACACTGAGCAAGCGGATATTTACTCGTTCGGAGTGGTTTTATCCGAGCTGGACACGTGC

Remove duplicated fasta sequence (bash of biopython method)

岁酱吖の 提交于 2019-12-18 09:52:00
问题 Hello I have a fasta file such as : >sequence1_CP [seq virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence2 [virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence3 MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence4_CP hypothetical protein [another virus]

How to find a open reading frame in Python

久未见 提交于 2019-12-18 05:43:40
问题 I am using Python and a regular expression to find an ORF (open reading frame). Find a sub-string a string that is composed ONLY of the letters ATGC (no spaces or new lines) that: Starts with ATG , ends with TAG or TAA or TGA and should consider the sequence from the first character, then second and then third: Seq= "CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGC AAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGA

Why installing package and module not same in Python?

梦想与她 提交于 2019-12-13 19:32:58
问题 I want to install the Biopython module. So I used the command sudo apt-get install python-biopython . That installs the package . Now if I type import Bio in Python, the compiler cannot find the module giving ImportError: no module named Bio . Doesn't installing the package imply installing the module? 回答1: I'm using 3.4.0 Since you are using Python 3.4 it won't work because the Debian package you install via apt-get will only install the Python2.x version. To install the Python 3 version, I

Muscle alignment in python

老子叫甜甜 提交于 2019-12-13 11:20:25
问题 I have a problem with printing my output from muscle aligning in python. My code is: from Bio.Align.Applications import MuscleCommandline from StringIO import StringIO from Bio import AlignIO def align_v1 (Fasta): muscle_cline = MuscleCommandline(input="hiv_protease_sequences_w_wt.fasta") stdout, stderr = muscle_cline() MultipleSeqAlignment = AlignIO.read(StringIO(stdout), "fasta") print MultipleSeqAlignment Any help? 回答1: It would be nice to know what error you received, but the following

Reading an entire directory of .pdb files using BioPython

时光怂恿深爱的人放手 提交于 2019-12-13 07:30:39
问题 I was recently tasked to write a program in python to find atoms within 2 angstroms distance from every metal in a protein from a .pdb (Protein Data Bank). This is the script I wrote for it. from Bio.PDB import * parser = PDBParser(PERMISSIVE=True) def print_coordinates(list): neighborList = list for y in neighborList: print " ", y.get_coord() structure_id = '5m6n' fileName = '5m6n.pdb' structure = parser.get_structure(structure_id, fileName) atomList = Selection.unfold_entities(structure, 'A

Filtering a FASTA file based on sequence with BioPython

痞子三分冷 提交于 2019-12-12 16:30:21
问题 I have a fasta file. From that file, I need to get the only sequences containing GTACAGTAGG and CAACGGTTTTGCC at the end and/or start of the sequence and put them in a new fasta file. So here's an example: >m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/2516_3269 ***GTACAGTAGG***GTACACACAGAACGCGACAAGGCCAGGCGCTGGAGGAACTCCAGCAGCTAGATGCAAGCGACTA TCAGAGCGTTGGGTCCAGAACGAAGAACAGTCACTCAAGACTGCTTT***CAACGGTTTTGCC*** >m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7