R not accepting xpath query

后端 未结 4 1311
不知归路
不知归路 2021-01-28 12:51

Hi I am using the XML package in R to scrape html pages. The page of interest is http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta and on that page there is a sequence

相关标签:
4条回答
  • 2021-01-28 13:31

    @brucezepplin, I feel your frustration. @Mathias Muller, I worked with what you wrote and ran the following:

    test <- "http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta" 
    doc <- htmlTreeParse(test, asText = TRUE, useInternalNodes = TRUE) 
    xpathSApply(doc, "//div[@id = 'viewercontent1']", xmlValue)
    xpathSApply(doc, "//div[@id = 'viewercontent1']//span[@id = 'gi_225903367_1']", xmlValue)
    xpathSApply(doc, "//div[@id = 'viewercontent1']/gi/span", xmlValue))
    

    First, when I looked at "doc" it only showed a couple of header lines, not the full page.

    But the first xpath returned list(), so at least it was functioning. The next two returned NULL. There is a <pre> before the desired span nodes as well as a >gi.

    In short, this is not an answer but perhaps will make it easier for someone else to provide a solution.

    0 讨论(0)
  • 2021-01-28 13:39

    If you go to this URL ncbi.nlm.nih.gov/protein/225903367?report=fasta you will see a sequence of letters starting with "MYS" and it's that sequence that I need.

    Finally I think I understood what you need. The content you are looking for is in the following span:

    <span id="gi_225903367_1" class="ff_line">
        MYSFNTLRLYLWETIVFFSLAASKEAEAARSAPKPMSPSDFLDKLMGRTS…
    </span>
    

    You find it with an XPath expression like:

    "//span[@id = 'gi_225903367_1']"
    

    Note: This is the correct expression to retrieve a span element with the id attribute value "gi_225903367_1". I cannot comment on whether you are applying XPath correctly in your R code.

    0 讨论(0)
  • 2021-01-28 13:43

    This gets the list, although I don't know if it's 100% correct as I don't work with fasta files. It seems like lapply(dat, cat) might need to be called on the dat result below.

    > library(RCurl)
    > library(XML)
    > url <- getURL("http://www.ncbi.nlm.nih.gov/protein/225903367?report=fasta")
    > dat <- readHTMLList(url)
    > length(dat)
    # [1] 39
    > object.size(dat)
    # 42704 bytes
    

    The whole list is not very big, so I'd recommend bringing the whole list into R. Then you have all the relevant data, and you don't have to spend the whole day trying to regex an html document. It looks like the unexpected symbol might be triggered because you wrote //*, and that * needs escape characters on it, possibly //[*].

    Edit that error you got was due to double quotation marks inside other double quotation marks. In R that should be quoted "//*[@id='viewercontent1']/pre"

    Yes, XML can be fussy, but it's generally because (1) it's the internet, and (2) the parser expects certain things to be in the html code and sometimes it's not. My professor wrote both RCurl and XML and he recommends going to RCurl::getURL when for the xml document when XML::readHTMLTable or any of the other read* functions have trouble.

    These issues you're having with the output are not strange. They are an empty result, which is as expected from the functions that assign attributes.

    0 讨论(0)
  • 2021-01-28 13:53

    The problem is that the page is created dynamically using javascript, and the sequence is not visible in the rendering returned to R.

    The CRAN package "rentrez" provides an interface to eutils, which is the programmatic way to query Entrez

    library(rentrez)
    entrez_fetch(db="protein", id="225903367", rettype="fasta")
    
    0 讨论(0)
提交回复
热议问题