问题
I was wondering if there is way to get the sequence of proteins from uniprot protein ids. I did check few online softwares but they allow to get one sequence at a time but I have 5536 vlues. Is there any package in biopython to do this?
回答1:
All the sequences from uniprot can be accesed from "http://www.uniprot.org/uniprot/" + UniprotID +.fasta. You can obtain any sequence with
import requests as r
from Bio import SeqIO
from io import StringIO
cID='P04637'
baseUrl="http://www.uniprot.org/uniprot/"
currentUrl=baseUrl+cID+".fasta"
response = r.post(currentUrl)
cData=''.join(response.text)
Seq=StringIO(cData)
pSeq=list(SeqIO.parse(Seq,'fasta'))
cID can be a list or a single entry, if you loop trough a bug list just add a delay between downloads, trying not to saturate the server. Hope it helps
回答2:
One of the fastest and easiest way to fetch many sequences from UniProt in Python is to use pyfaidx package. It is a simple, but well-tested tool, build upon well-known algorithm from SAMtools. It is also citable for academic publications.
Simply download fasta file with all sequences (or just a chosen subset) from https://www.uniprot.org/downloads, unpack the file if needed, install pyfadix (e.g. with pip install pyfaidx --user
or bioconda) and load the sequences with Fasta
constructor:
from pyfaidx import Fasta
sequences = Fasta('uniprot_sprot.fasta')
The first loading may take a while, but afterwards all operations will be very fast. Now sequences
is a dict-like object, so you can access the entry you need with:
p53 = sequences['sp|P04637|P53_HUMAN']
print(p53)
which shows the sequence:
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
This sequence object is however much more than a string - it provides many handy utility functions and attributes (long_name
, unpadded_len
, slices with with: start
, end
, complement()
and reverse()
and so on - see the documentation for more).
If you want to access the sequence by the UniprotID instead of the full identifier from fasta file, use:
def extract_id(header):
return header.split('|')[1]
sequences = Fasta('uniprot_sprot.fasta', key_function=extract_id)
print(sequences['P04637'])
PS. Just one caveat - watch out for 1-based indexing.
回答3:
You can probably iterate over your list of values, calling the required method from the library each time.
回答4:
Try the below code to get all the protein sequence provided in the query parameter
import urllib,urllib2
url = 'https://www.uniprot.org/uploadlists/'
params = {
'from':'ACC+ID',
'to':'ACC',
'format':'txt',
'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}
data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # contact email address debug
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read()
print page
回答5:
You can get the sequences from the SwissProt/UniProt database also from the NCBI Entrez server. A way to fetch files from NCBI Entrez and read the sequences is the Python package biotite
:
>>> import biotite.database.entrez as entrez
>>> import biotite.sequence as seq
>>> import biotite.sequence.io.fasta as fasta
>>> # Find UIDs for SwissProt/UniProt entries
>>> query = entrez.SimpleQuery("Avidin", "Protein Name") \
... & entrez.SimpleQuery("Gallus gallus", "Organism") \
... & entrez.SimpleQuery("srcdb_swiss-prot", "Properties")
>>> print(query)
((Avidin[Protein Name]) AND ("Gallus gallus"[Organism])) AND (srcdb_swiss-prot[Properties])
>>> uids = entrez.search(query, db_name="protein")
>>> print(uids)
['158515411']
>>> # Download FASTA file containing the sequence(s)
>>> # from NCBI Entrez database
>>> file_name = entrez.fetch_single_file(
... uids, "avidin.fa", db_name="protein", ret_type="fasta"
... )
>>> # Read file
>>> fasta_file = fasta.FastaFile()
>>> fasta_file.read(file_name)
>>> print(fasta_file)
>sp|P02701.3|AVID_CHICK RecName: Full=Avidin; Flags: Precursor
MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYITAVTATSNEIKE
SPLHGTQNTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVG
INIFTRLRTQKE
>>> # Convert first sequence in file to 'ProteinSequence' object
>>> seq = fasta.get_sequence(fasta_file)
>>> print(seq)
MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYITAVTATSNEIKESPLHGTQNTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE
来源:https://stackoverflow.com/questions/52569622/protein-sequence-from-uniprot-protein-id-python