Hi I am using the XML package in R to scrape html pages. The page of interest is http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta and on that page there is a sequence
@brucezepplin, I feel your frustration. @Mathias Muller, I worked with what you wrote and ran the following:
test <- "http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta"
doc <- htmlTreeParse(test, asText = TRUE, useInternalNodes = TRUE)
xpathSApply(doc, "//div[@id = 'viewercontent1']", xmlValue)
xpathSApply(doc, "//div[@id = 'viewercontent1']//span[@id = 'gi_225903367_1']", xmlValue)
xpathSApply(doc, "//div[@id = 'viewercontent1']/gi/span", xmlValue))
First, when I looked at "doc" it only showed a couple of header lines, not the full page.
But the first xpath returned list()
, so at least it was functioning. The next two returned NUL
L. There is a <pre>
before the desired span nodes as well as a >gi.
In short, this is not an answer but perhaps will make it easier for someone else to provide a solution.
If you go to this URL ncbi.nlm.nih.gov/protein/225903367?report=fasta you will see a sequence of letters starting with "MYS" and it's that sequence that I need.
Finally I think I understood what you need. The content you are looking for is in the following span
:
<span id="gi_225903367_1" class="ff_line">
MYSFNTLRLYLWETIVFFSLAASKEAEAARSAPKPMSPSDFLDKLMGRTS…
</span>
You find it with an XPath expression like:
"//span[@id = 'gi_225903367_1']"
Note: This is the correct expression to retrieve a span
element with the id
attribute value "gi_225903367_1". I cannot comment on whether you are applying XPath correctly in your R code.
This gets the list, although I don't know if it's 100% correct as I don't work with fasta files. It seems like lapply(dat, cat)
might need to be called on the dat
result below.
> library(RCurl)
> library(XML)
> url <- getURL("http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta")
> dat <- readHTMLList(url)
> length(dat)
# [1] 39
> object.size(dat)
# 42704 bytes
The whole list is not very big, so I'd recommend bringing the whole list into R. Then you have all the relevant data, and you don't have to spend the whole day trying to regex an html document. It looks like the unexpected symbol might be triggered because you wrote //*
, and that *
needs escape characters on it, possibly //[*]
.
Edit that error you got was due to double quotation marks inside other double quotation marks. In R that should be quoted "//*[@id='viewercontent1']/pre"
Yes, XML
can be fussy, but it's generally because (1) it's the internet, and (2) the parser expects certain things to be in the html code and sometimes it's not. My professor wrote both RCurl
and XML
and he recommends going to RCurl::getURL
when for the xml document when XML::readHTMLTable
or any of the other read*
functions have trouble.
These issues you're having with the output are not strange. They are an empty result, which is as expected from the functions that assign attributes.
The problem is that the page is created dynamically using javascript, and the sequence is not visible in the rendering returned to R.
The CRAN package "rentrez" provides an interface to eutils, which is the programmatic way to query Entrez
library(rentrez)
entrez_fetch(db="protein", id="225903367", rettype="fasta")