how can I extract data from html file using R [closed]

问题

I want to extract some data from the GEO website, how can I do this? The URL of the site is http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM410750, and I want to get the "disease state" of the patient, I used the command

readLines("http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM410750")

to import the html, the information I need is in the 288th line. Could someone help me? Thank you very much. I will appreciate it.

回答1:

Usually when questions like this are asked some effort needs to be shown. So please take consideration to state the exact problem with at least some effort on what you have attempted next time. To get you started here is an example using the XML package and applying XPath along with strsplit to grab the desired result.

library(XML)
doc <- htmlParse("http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM410750")
x <- xpathSApply(doc, "//td[@style='text-align: justify']/text()[preceding-sibling::br][1]",
    function(X) { strsplit(xmlValue(X), ': ')[[1]][2]
})
# [1] "Uninfected"

回答2:

It might be worth the effort to take a look at the CRAN Task View: Web Technologies and Services. There are a host of packages that enable you to read data from web pages far superior to readLines. To be truly successful with scraping data from the web you really need to be familiar with things like web sessions and XPath selectors or CSS selector.

Even with that knowledge, you often still need to use regular expressions to extract the data you need since many web pages have truly horrible formatting when it comes to trying to use them as a data source.

There's a newer package (not mentioned on that page since it's not in CRAN yet) called rvest that combines a few of those packages into one and that makes doing work like this much easier than before. You can definitely use it and the stringr package to get the data you need. The following code is potentially extremely fragile for your use case (it depends quite a bit on how the <td> containing the disease state is formatted. If you can work through what the XPath is doing and what the regex is extracting, you should be able to tailor it to your needs. Note that it also uses magrittr-style piping.

library(rvest)
library(stringr)

pg <- html("http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM410750")

disease <- pg %>% 
  html_nodes(xpath="//td[text()[contains(.,'disease state')]]/br[1]/following-sibling::text()[1]") %>% 
  html_text()

state <- str_match(disease, "disease state: ([[:alnum:]]+)")[,2]
state

## [1] "Uninfected"

来源：https://stackoverflow.com/questions/26209043/how-can-i-extract-data-from-html-file-using-r

标签

html

extract