biopython

SeqIO.parse on a fasta.gz

…衆ロ難τιáo~ 提交于 2019-12-10 01:58:51
问题 New to coding. New to Pytho/biopython; this is my first question online, ever. How do I open a compressed fasta.gz file to extract info and perform calcuations in my function. Here is a simplified example of what I'm trying to do (I've tried different ways), and what the error is. The gzip command I'm using doesn't seem to work.? with gzip.open("practicezip.fasta.gz", "r") as handle: for record in SeqIO.parse(handle, "fasta"): print(record.id) Traceback (most recent call last): File "<ipython

Is there a function that can calculate a score for aligned sequences given the alignment parameters?

我与影子孤独终老i 提交于 2019-12-09 06:20:40
问题 I try to score the already-aligned sequences. Let say seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE' seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE' with given parameters substitution matrix : blosum62 gap open penalty : -5 gap extension penalty : -1 I did look through the biopython cookbook but all i can get is substitution matrix blogsum62 but I feel that it must have someone already implemented this kind of library. So can anyone suggest any libraries or shortest code that can

Biopython SeqIO to Pandas Dataframe

不问归期 提交于 2019-12-09 04:55:14
问题 I have a FASTA file that can easily be parsed by SeqIO.parse. I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.) from Bio import SeqIO import pandas as pd # parse sequence fasta file identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta", "fasta")] lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta", "fasta")] #converting lists to

(BioPython) How do I stop MemoryError: Out of Memory exception?

我是研究僧i 提交于 2019-12-08 09:41:42
问题 I have a program where I take a pair of very large multiple sequence files (>77,000 sequences each averaging about 1000 bp long) and calculate the alignment score between each paired individual element and write that number into an output file (which I will load into an excel file later). My code works for small multiple sequence files but my large master file will throw the following traceback after analyzing the 16th pair. Traceback (most recent call last): File "C:\Users\Harry\Documents

Update Biopython in Ipython shell

限于喜欢 提交于 2019-12-08 09:29:00
问题 python newbie here, today I updated biopython to v1.70. I use spyder/IPython shell for most of my work. if it makes any sense, the version of biopython seems updated in the python console (spyder), and windows command terminal python installation, but IPython console is still showing older version. Should that be updated separately, and if yes, how. Thanks for answers. 回答1: For Anaconda on Windows, using the Command Prompt (what you get when typing cmd or clicking on Command Prompt in the

List and RestrictionType from Biopython

ぐ巨炮叔叔 提交于 2019-12-08 06:52:45
问题 I am experimenting some problems using the Bio.Restrictions methods, I am not sure if it is due to python, biopython or my poor understanding of python. When I try to crate a RestrictionBatch following the cookbook, I want to use enzymes I from a dictionary (read from files), and it says: You can initiate a restriction batch by passing it a list of enzymes or enzymes name as argument. In the python documentation for dict.keys says: Return a copy of the dictionary’s list of keys So I tried

urllib2.HTTPError Python

坚强是说给别人听的谎言 提交于 2019-12-08 05:30:05
问题 I have a file with GI numbers and would like to get FASTA sequences from ncbi. from Bio import Entrez import time Entrez.email ="eigtw59tyjrt403@gmail.com" f = open("C:\\bioinformatics\\gilist.txt") for line in iter(f): handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml") records = Entrez.read(handle) print ">GI "+line.rstrip()+" "+records[0]["GBSeq_primary-accession"]+" "+records[0]["GBSeq_definition"]+"\n"+records[0]["GBSeq_sequence"] time.sleep(1) # to make sure not many

The new RefSeq release from NCBI is compatible with Bio.Entrez.Parser?

随声附和 提交于 2019-12-08 03:16:41
问题 I'm new with python and especially with Biopython. I'm trying to take some information from an XML file with Entrez.efetch and then read it. Last week this script worked well: handle = Entrez.efetch(db="Protein", id="YP_008872780.1", retmode="xml") records = Entrez.read(handle) But now I'm getting an Error: > Bio.Entrez.Parser.ValidationError: Failed to find tag 'GBSeq_xrefs' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with

how to download complete genome sequence in biopython entrez.esearch

ⅰ亾dé卋堺 提交于 2019-12-07 12:53:37
问题 I have to download only complete genome sequences from NCBI (GenBank(full) format). I am intrested in 'complete geneome' not 'whole genome'. my script: from Bio import Entrez Entrez.email = "asiakXX@wp.pl" gatunek='Escherichia[ORGN]' handle = Entrez.esearch(db='nucleotide', term=gatunek, property='complete genome' )#title='complete genome[title]') result = Entrez.read(handle) As a results I get only small fragments of genomes, whith size about 484 bp: LOCUS NZ_KE350773 484 bp DNA linear CON

Python find longest ORF in DNA sequence

荒凉一梦 提交于 2019-12-06 15:33:33
问题 Can someone show me a straightforward solution for how to calculate the longest open reading frame (ORF) in a DNA sequence? ATG is the start codon (i.e., the beginning of an ORF) and TAG , TGA , and TAA are stop codons (i.e., the end of an ORF). Here's some code that produces errors (and uses an external module called BioPython): import sys from Bio import SeqIO currentCid = '' buffer = [] for record in SeqIO.parse(open(sys.argv[1]),"fasta"): cid = str(record.description).split('.')[0][1:] if