R not accepting xpath query

后端未结

关注

 4  1311

Hi I am using the XML package in R to scrape html pages. The page of interest is http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta and on that page there is a sequence

相关标签:

4条回答

悲&欢浪女

2021-01-28 13:31
@brucezepplin, I feel your frustration. @Mathias Muller, I worked with what you wrote and ran the following:
```
test <- "http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta" 
doc <- htmlTreeParse(test, asText = TRUE, useInternalNodes = TRUE) 
xpathSApply(doc, "//div[@id = 'viewercontent1']", xmlValue)
xpathSApply(doc, "//div[@id = 'viewercontent1']//span[@id = 'gi_225903367_1']", xmlValue)
xpathSApply(doc, "//div[@id = 'viewercontent1']/gi/span", xmlValue))
```
First, when I looked at "doc" it only showed a couple of header lines, not the full page.

But the first xpath returned list(), so at least it was functioning. The next two returned NULL. There is a <pre> before the desired span nodes as well as a >gi.

In short, this is not an answer but perhaps will make it easier for someone else to provide a solution.
0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2021-01-28 13:39
If you go to this URL ncbi.nlm.nih.gov/protein/225903367?report=fasta you will see a sequence of letters starting with "MYS" and it's that sequence that I need.

Finally I think I understood what you need. The content you are looking for is in the following span:
```
<span id="gi_225903367_1" class="ff_line">
    MYSFNTLRLYLWETIVFFSLAASKEAEAARSAPKPMSPSDFLDKLMGRTS…
</span>
```
You find it with an XPath expression like:
```
"//span[@id = 'gi_225903367_1']"
```
Note: This is the correct expression to retrieve a span element with the id attribute value "gi_225903367_1". I cannot comment on whether you are applying XPath correctly in your R code.
0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2021-01-28 13:43
This gets the list, although I don't know if it's 100% correct as I don't work with fasta files. It seems like lapply(dat, cat) might need to be called on the dat result below.
```
> library(RCurl)
> library(XML)
> url <- getURL("http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta")
> dat <- readHTMLList(url)
> length(dat)
# [1] 39
> object.size(dat)
# 42704 bytes
```
The whole list is not very big, so I'd recommend bringing the whole list into R. Then you have all the relevant data, and you don't have to spend the whole day trying to regex an html document. It looks like the unexpected symbol might be triggered because you wrote //*, and that * needs escape characters on it, possibly //[*].

Edit that error you got was due to double quotation marks inside other double quotation marks. In R that should be quoted "//*[@id='viewercontent1']/pre"

Yes, XML can be fussy, but it's generally because (1) it's the internet, and (2) the parser expects certain things to be in the html code and sometimes it's not. My professor wrote both RCurl and XML and he recommends going to RCurl::getURL when for the xml document when XML::readHTMLTable or any of the other read* functions have trouble.

These issues you're having with the output are not strange. They are an empty result, which is as expected from the functions that assign attributes.
0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2021-01-28 13:53
The problem is that the page is created dynamically using javascript, and the sequence is not visible in the rendering returned to R.

The CRAN package "rentrez" provides an interface to eutils, which is the programmatic way to query Entrez
```
library(rentrez)
entrez_fetch(db="protein", id="225903367", rettype="fasta")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...