Using search terms with Biopython to return accession numbers

无人久伴 提交于 2019-12-24 17:53:21

问题


I am trying to use Biopython (Entrez) with search terms that will return the accession number (and not the GI*).

Here is a tiny excerpt of my code:

from Bio import Entrez

Entrez.email = 'myemailaddress'
search_phrase = 'Escherichia coli[organism]) AND (complete genome[keyword])'
handle = Entrez.esearch(db='nuccore', term=search_phrase, retmax=100, rettype='acc', retmode='text')
result = Entrez.read(handle)
handle.close()
gi_numbers = result['IdList']
print(gi_numbers)

'745369752', '910228862', '187736741', '802098270', '802098269', '802098267', '387610477', '544579032', '544574430', '215485161', '749295052', '387823261', '387605479', '641687520', '641682562', '594009615', '557270520', '313848522', '309700213', '284919779', '215263233', '544345556', '544340954', '144661', '51773702', '202957457', '202957451', '172051323'

I am sure I can convert from GI to accession, but it would be nice to avoid the additional step. What slice of magic am I missing?

Thank you in advance.

*especially since NCBI is phasing out GI numbers


回答1:


Looking through the docs for esearch on NCBI's website, there are only two rettypes available - uilist, which is the default XML format that you're getting currently (it's parsed into a dict by Entrez.read()), and count, which just displays the Count value (look at the complete contents of result, it's there), which I'm unclear on its exact meaning, as it doesn't represent the total number of items in IdList...

At any rate, Entrez.esearch() will take any value of rettype and retmode you like, but it only returns the uilist or count in xml or json mode - no accession IDs, no nothin'.

Entrez.efetch() will pass you back all sorts of cool stuff, depending on which DB you're querying. The downside, of course, is that you need to query by one or more IDs, not by a search string, so in order to get your accession IDs you'd need to run two queries:

search_phrase = "Escherichia coli[organism]) AND (complete genome[keyword])"
handle = Entrez.esearch(db="nuccore", term=search_phrase, retmax=100)
result = Entrez.read(handle)
handle.close()
fetch_handle = Entrez.efetch(db="nuccore", id=results["IdList"], rettype="acc", retmode="text")
acc_ids = [id.strip() for id in fetch_handle]
fetch_handle.close()
print(acc_ids)

gives

['HF572917.2', 'NZ_HF572917.1', 'NC_010558.1', 'NZ_HG941720.1', 'NZ_HG941719.1', 'NZ_HG941718.1', 'NC_017633.1', 'NC_022371.1', 'NC_022370.1', 'NC_011601.1', 'NZ_HG738867.1', 'NC_012892.2', 'NC_017626.1', 'HG941719.1', 'HG941718.1', 'HG941720.1', 'HG738867.1', 'AM946981.2', 'FN649414.1', 'FN554766.1', 'FM180568.1', 'HG428756.1', 'HG428755.1', 'M37402.1', 'AJ304858.2', 'FM206294.1', 'FM206293.1', 'AM886293.1']

So, I'm not terribly sure if I answered your question satisfactorily, but unfortunately I think the answer is "There is no magic."



来源:https://stackoverflow.com/questions/37059616/using-search-terms-with-biopython-to-return-accession-numbers

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!