问题
I have a file with GI numbers and would like to get FASTA
sequences from ncbi.
from Bio import Entrez
import time
Entrez.email ="eigtw59tyjrt403@gmail.com"
f = open("C:\\bioinformatics\\gilist.txt")
for line in iter(f):
handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml")
records = Entrez.read(handle)
print ">GI "+line.rstrip()+" "+records[0]["GBSeq_primary-accession"]+" "+records[0]["GBSeq_definition"]+"\n"+records[0]["GBSeq_sequence"]
time.sleep(1) # to make sure not many requests go per second to ncbi
f.close()
This script runs fine but I suddenly get this error message after a few sequences.
Traceback (most recent call last):
File "C:/Users/Ankur/PycharmProjects/ncbiseq/getncbiSeq.py", line 7, in <module>
handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml")
File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 139, in efetch
return _open(cgi, variables)
File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 455, in _open
raise exception
urllib2.HTTPError: HTTP Error 500: Internal Server Error
Of course I can use http://www.ncbi.nlm.nih.gov/sites/batchentrez
but I am trying to create a pipeline and would like something automated.
How can I prevent ncbi from "kicking me out"
回答1:
I'm not familiar with the ncbi API, but my guess is that you're violating some kind of rate limiting rule (even with "sleep(1)" ), so your earlier requests work, but after a few requests the server sees that you're hitting it frequently and blocks you. This is problematic for you because you have no error handling in your code.
I'd recommend wrapping your data fetch in a try/except block to make your script wait longer and then try again if it encounters issues. If all else fails, write the id that caused the error to a file and continue on (in case the id is somehow the culprit, maybe causing the Entrez library to generate a bad URL).
Try changing your code to something like this (untested):
from urllib2 import HTTPError
from Bio import Entrez
import time
def get_record(_id):
handle = Entrez.efetch(db="nucleotide", id=_id, retmode="xml")
records = Entrez.read(handle)
print ">GI "+line.rstrip()+" "+records[0]["GBSeq_primary-accession"]+" "+records[0]["GBSeq_definition"]+"\n"+records[0]["GBSeq_sequence"]
time.sleep(1) # to make sure not many requests go per second to ncbi
Entrez.email ="eigtw59tyjrt403@gmail.com"
f = open("C:\\bioinformatics\\gilist.txt")
for id in iter(f):
try:
get_record(id)
except HTTPError:
print "Error fetching", id
time.sleep(5) # we have angered the API! Try waiting longer?
try:
get_record(id)
except:
with open('error_records.bad','a') as f:
f.write(str(id)+'\n')
continue #
f.close()
回答2:
There is a work around called efetch. You could split your list into batches of 200 (gut feeling this is an OK batch size) and use efetch to send all these ids all at once.
First, this is MUCH, much faster than sending 200 individual queries. Second, it also effectively complies with the "3 queries per second" rule because the processing time per query is longer than 0.33 second but not too long.
However, you do need a mechanism to catch the "bad apples". NCBI will return 0 result even if one out of your 200 ids is bad. In other words, NCBI only returns results if and only if all your 200 ids are valid.
In case of bad apple, I iterate through the 200 ids one by one and ignore the bad apple. This "what if bad apple" scenario also tells you not to make the batch too big, just in case of bad apple. If it is big, first, the chance of having a bad apple is bigger, that is, you more often have to iterate the whole thing. Second, the larger the batch, the more individual items you have to iterate.
I use the following code to download the CAZy proteins and it works well:
import urllib2
prefix = "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&id="
id_per_request = 200
def getSeq (id_list):
url = prefix + id_list[:len(id_list)-1]
temp_content = ""
try:
temp_content += urllib2.urlopen(url).read()
### if there is a bad apple, try one by one
except:
for id in id_list[:len(id_list)-1].split(","):
url = prefix + id
#print url
try:
temp_content += urllib2.urlopen(url).read()
except:
#print id
pass
return temp_content
content = ""
counter = 0
id_list = ""
#define your accession numbers first, here it is just an example!!
accs = ["ADL19140.1","ABW01768.1","CCQ33656.1"]
for acc in accs:
id_list += acc + ","
counter += 1
if counter == id_per_request:
counter = 0
content += getSeq(id_list)
id_list = ""
if id_list != "":
content += getSeq(id_list)
id_list = ""
print content
回答3:
It's a "normal" Entrez API temporary fail which can occur even if you've applied all Entrez API rules. Biopython documentation explains a way to handle it in this section.
Sometimes you will get intermittent errors from Entrez, HTTPError 5XX, we use a try except pause retry block to address this. For example,
# This assumes you have already run a search as shown above, # and set the variables count, webenv, query_key try: from urllib.error import HTTPError # for Python 3 except ImportError: from urllib2 import HTTPError # for Python 2 batch_size = 3 out_handle = open("orchid_rpl16.fasta", "w") for start in range(0, count, batch_size): end = min(count, start+batch_size) print("Going to download record %i to %i" % (start+1, end)) attempt = 0 while attempt < 3: attempt += 1 try: fetch_handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text", retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key, idtype="acc") except HTTPError as err: if 500 <= err.code <= 599: print("Received error from server %s" % err) print("Attempt %i of 3" % attempt) time.sleep(15) else: raise data = fetch_handle.read() fetch_handle.close() out_handle.write(data) out_handle.close()
So you don't to feel guilty about this error and just have to catch it.
来源:https://stackoverflow.com/questions/14827131/urllib2-httperror-python