biopython | 易学教程

SeqIO.parse on a fasta.gz

阅读更多关于 SeqIO.parse on a fasta.gz

问题 New to coding. New to Pytho/biopython; this is my first question online, ever. How do I open a compressed fasta.gz file to extract info and perform calcuations in my function. Here is a simplified example of what I'm trying to do (I've tried different ways), and what the error is. The gzip command I'm using doesn't seem to work.? with gzip.open("practicezip.fasta.gz", "r") as handle: for record in SeqIO.parse(handle, "fasta"): print(record.id) Traceback (most recent call last): File "<ipython

Is there a function that can calculate a score for aligned sequences given the alignment parameters?

阅读更多关于 Is there a function that can calculate a score for aligned sequences given the alignment parameters?

问题 I try to score the already-aligned sequences. Let say seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE' seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE' with given parameters substitution matrix : blosum62 gap open penalty : -5 gap extension penalty : -1 I did look through the biopython cookbook but all i can get is substitution matrix blogsum62 but I feel that it must have someone already implemented this kind of library. So can anyone suggest any libraries or shortest code that can

Biopython SeqIO to Pandas Dataframe

阅读更多关于 Biopython SeqIO to Pandas Dataframe

问题 I have a FASTA file that can easily be parsed by SeqIO.parse. I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.) from Bio import SeqIO import pandas as pd # parse sequence fasta file identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta", "fasta")] lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta", "fasta")] #converting lists to

(BioPython) How do I stop MemoryError: Out of Memory exception?

阅读更多关于 (BioPython) How do I stop MemoryError: Out of Memory exception?

问题 I have a program where I take a pair of very large multiple sequence files (>77,000 sequences each averaging about 1000 bp long) and calculate the alignment score between each paired individual element and write that number into an output file (which I will load into an excel file later). My code works for small multiple sequence files but my large master file will throw the following traceback after analyzing the 16th pair. Traceback (most recent call last): File "C:\Users\Harry\Documents

Update Biopython in Ipython shell

阅读更多关于 Update Biopython in Ipython shell

问题 python newbie here, today I updated biopython to v1.70. I use spyder/IPython shell for most of my work. if it makes any sense, the version of biopython seems updated in the python console (spyder), and windows command terminal python installation, but IPython console is still showing older version. Should that be updated separately, and if yes, how. Thanks for answers. 回答1: For Anaconda on Windows, using the Command Prompt (what you get when typing cmd or clicking on Command Prompt in the

List and RestrictionType from Biopython

阅读更多关于 List and RestrictionType from Biopython

问题 I am experimenting some problems using the Bio.Restrictions methods, I am not sure if it is due to python, biopython or my poor understanding of python. When I try to crate a RestrictionBatch following the cookbook, I want to use enzymes I from a dictionary (read from files), and it says: You can initiate a restriction batch by passing it a list of enzymes or enzymes name as argument. In the python documentation for dict.keys says: Return a copy of the dictionary’s list of keys So I tried

urllib2.HTTPError Python

阅读更多关于 urllib2.HTTPError Python

问题 I have a file with GI numbers and would like to get FASTA sequences from ncbi. from Bio import Entrez import time Entrez.email ="eigtw59tyjrt403@gmail.com" f = open("C:\\bioinformatics\\gilist.txt") for line in iter(f): handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml") records = Entrez.read(handle) print ">GI "+line.rstrip()+" "+records[0]["GBSeq_primary-accession"]+" "+records[0]["GBSeq_definition"]+"\n"+records[0]["GBSeq_sequence"] time.sleep(1) # to make sure not many

The new RefSeq release from NCBI is compatible with Bio.Entrez.Parser?

阅读更多关于 The new RefSeq release from NCBI is compatible with Bio.Entrez.Parser?

问题 I'm new with python and especially with Biopython. I'm trying to take some information from an XML file with Entrez.efetch and then read it. Last week this script worked well: handle = Entrez.efetch(db="Protein", id="YP_008872780.1", retmode="xml") records = Entrez.read(handle) But now I'm getting an Error: > Bio.Entrez.Parser.ValidationError: Failed to find tag 'GBSeq_xrefs' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with

how to download complete genome sequence in biopython entrez.esearch

阅读更多关于 how to download complete genome sequence in biopython entrez.esearch

问题 I have to download only complete genome sequences from NCBI (GenBank(full) format). I am intrested in 'complete geneome' not 'whole genome'. my script: from Bio import Entrez Entrez.email = "asiakXX@wp.pl" gatunek='Escherichia[ORGN]' handle = Entrez.esearch(db='nucleotide', term=gatunek, property='complete genome' )#title='complete genome[title]') result = Entrez.read(handle) As a results I get only small fragments of genomes, whith size about 484 bp: LOCUS NZ_KE350773 484 bp DNA linear CON

Python find longest ORF in DNA sequence

阅读更多关于 Python find longest ORF in DNA sequence

问题 Can someone show me a straightforward solution for how to calculate the longest open reading frame (ORF) in a DNA sequence? ATG is the start codon (i.e., the beginning of an ORF) and TAG , TGA , and TAA are stop codons (i.e., the end of an ORF). Here's some code that produces errors (and uses an external module called BioPython): import sys from Bio import SeqIO currentCid = '' buffer = [] for record in SeqIO.parse(open(sys.argv[1]),"fasta"): cid = str(record.description).split('.')[0][1:] if