Protein sequence from uniprot protein id python

问题

I was wondering if there is way to get the sequence of proteins from uniprot protein ids. I did check few online softwares but they allow to get one sequence at a time but I have 5536 vlues. Is there any package in biopython to do this?

回答1:

All the sequences from uniprot can be accesed from "http://www.uniprot.org/uniprot/" + UniprotID +.fasta. You can obtain any sequence with

import requests as r
from Bio import SeqIO
from io import StringIO

cID='P04637'

baseUrl="http://www.uniprot.org/uniprot/"
currentUrl=baseUrl+cID+".fasta"
response = r.post(currentUrl)
cData=''.join(response.text)

Seq=StringIO(cData)
pSeq=list(SeqIO.parse(Seq,'fasta'))

cID can be a list or a single entry, if you loop trough a bug list just add a delay between downloads, trying not to saturate the server. Hope it helps

回答2:

One of the fastest and easiest way to fetch many sequences from UniProt in Python is to use pyfaidx package. It is a simple, but well-tested tool, build upon well-known algorithm from SAMtools. It is also citable for academic publications.

Simply download fasta file with all sequences (or just a chosen subset) from https://www.uniprot.org/downloads, unpack the file if needed, install pyfadix (e.g. with pip install pyfaidx --user or bioconda) and load the sequences with Fasta constructor:

from pyfaidx import Fasta
sequences = Fasta('uniprot_sprot.fasta')

The first loading may take a while, but afterwards all operations will be very fast. Now sequences is a dict-like object, so you can access the entry you need with:

p53 = sequences['sp|P04637|P53_HUMAN']
print(p53)

which shows the sequence:

MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

This sequence object is however much more than a string - it provides many handy utility functions and attributes (long_name, unpadded_len, slices with with: start, end, complement() and reverse() and so on - see the documentation for more).

If you want to access the sequence by the UniprotID instead of the full identifier from fasta file, use:

def extract_id(header):
    return header.split('|')[1]

sequences = Fasta('uniprot_sprot.fasta', key_function=extract_id)
print(sequences['P04637'])

PS. Just one caveat - watch out for 1-based indexing.

回答3:

You can probably iterate over your list of values, calling the required method from the library each time.

回答4:

Try the below code to get all the protein sequence provided in the query parameter

import urllib,urllib2
url = 'https://www.uniprot.org/uploadlists/'
params = {
    'from':'ACC+ID',
    'to':'ACC',
    'format':'txt',
    'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}
data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # contact email address debug
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read()
print page

回答5:

You can get the sequences from the SwissProt/UniProt database also from the NCBI Entrez server. A way to fetch files from NCBI Entrez and read the sequences is the Python package biotite:

>>> import biotite.database.entrez as entrez
>>> import biotite.sequence as seq
>>> import biotite.sequence.io.fasta as fasta
>>> # Find UIDs for SwissProt/UniProt entries
>>> query =   entrez.SimpleQuery("Avidin", "Protein Name") \
...         & entrez.SimpleQuery("Gallus gallus", "Organism") \
...         & entrez.SimpleQuery("srcdb_swiss-prot", "Properties")
>>> print(query)
((Avidin[Protein Name]) AND ("Gallus gallus"[Organism])) AND (srcdb_swiss-prot[Properties])
>>> uids = entrez.search(query, db_name="protein")
>>> print(uids)
['158515411']
>>> # Download FASTA file containing the sequence(s)
>>> # from NCBI Entrez database
>>> file_name = entrez.fetch_single_file(
...     uids, "avidin.fa", db_name="protein", ret_type="fasta"
... )
>>> # Read file
>>> fasta_file = fasta.FastaFile()
>>> fasta_file.read(file_name)
>>> print(fasta_file)
>sp|P02701.3|AVID_CHICK RecName: Full=Avidin; Flags: Precursor
MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYITAVTATSNEIKE
SPLHGTQNTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVG
INIFTRLRTQKE
>>> # Convert first sequence in file to 'ProteinSequence' object
>>> seq = fasta.get_sequence(fasta_file)
>>> print(seq)
MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYITAVTATSNEIKESPLHGTQNTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE

来源：https://stackoverflow.com/questions/52569622/protein-sequence-from-uniprot-protein-id-python

标签

python

bioinformatics

biopython