I\'ve inherited a dataset of RNAseq output data from Canis Lupus (dog). I have the gene identifier in the Ensembl format, specifically they look like this, ENSCAFT00000001452.3.
Here is step-by-step example:
Load the biomaRt
library.
library(biomaRt)
As query input we have Canis lupus familiaris Ensembl transcript IDs (note that they are not Ensembl gene IDs). We also need to strip the dot+digit(s) from the end, which is used to indicate annotation updates.
tx <- c("ENSCAFT00000001452.3", "ENSCAFT00000001656.3")
tx <- gsub("\\.\\d+$", "", tx)
We now query the database for the Ensembl transcript IDs in tx
ensembl <- useEnsembl(biomart = "ensembl", dataset = "cfamiliaris_gene_ensembl")
res <- getBM(
attributes = c("ensembl_gene_id", "ensembl_transcript_id", "external_gene_name", "description"),
filters = "ensembl_transcript_id",
values = tx,
mart = ensembl)
res
#ensembl_gene_id ensembl_transcript_id external_gene_name
#1 ENSCAFG00000000934 ENSCAFT00000001452 COL14A1
#2 ENSCAFG00000001086 ENSCAFT00000001656 MYC
# description
#1 collagen type XIV alpha 1 chain [Source:VGNC Symbol;Acc:VGNC:51768]
#2 MYC proto-oncogene, bHLH transcription factor [Source:VGNC Symbol;Acc:VGNC:43527]
Note that you can get a data.frame
of all attributes
for a particular mart
with listAttributes(ensembl)
.
Additionally to the link @GordonShumway gives in the comment above, another good (and succinct) summary/introduction to biomaRt
can be found on the Ensembl websites.