Scraping experimentally measured physicochemical properties and synonyms from Chemspider in R

问题

Although the Chemspider SSOAP Web API allows one to retrieve the chemical structure of given compounds, it does not allow one to retrieve experimentally measured physicochemical properties like boiling points and listed synonyms.

E.g. if you look in http://www.chemspider.com/Chemical-Structure.733.html it gives a list of Synonyms and Experimental data under Properties (you may have to register first to see this info), which I would like to retrieve in R.

I got some way by doing

library(httr)
library(XML)
csid="733" # chemspider ID of glycerin
url=paste("http://www.chemspider.com/Chemical-Structure.",csid,".html",sep="")
webp=GET(url)
doc=htmlParse(webp,encoding="UTF-8")

but then I would like to retrieve and parse the sections with chemical properties following

<div class="tab-content" id="epiTab"> and 
<div class="tab-content" id="acdLabsTab">

and also fetch all the synonyms given after each section

<p class="syn" xmlns:cs="http://www.chemspider.com" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

What would be the most elegant way of doing this, e.g. using xpathSApply (as opposed to a simple strsplit / gsub job)?

cheers, Tom

回答1:

Web scraping is always fraught. For one thing, you have no guarantee the the provider will not change their formatting at some point in the future. For another, the current formats are anything but standardized. Avoiding this was the whole point of SOAP and XML web services.

Having said all that, this should get you started:

library(XML)
# load and parse the document
csid     <- "733" # chemspider ID of glycerin
url      <- paste0("http://www.chemspider.com/Chemical-Structure.",csid,".html")
doc      <- htmlTreeParse(url,useInternal=T)

The data in the epi tab are actually in a text block (e.g. <pre>...</pre>), so the best we can do with XPath is to grab that text. From there you still need some kind of regex solution to parse out the parameters. The example below deals with MP, BP, and VP.

# parse epiTab
epiTab   <- xmlValue(getNodeSet(doc,'//div[@id="epiTab"]/pre')[[1]])
epiTab   <- unlist(strsplit(epiTab,"\n"))
params   <- c(MP="Melting Pt (deg C):",
              BP="Boiling Pt (deg C):",
              VP="VP(mm Hg,25 deg C):")
prop <- sapply(params,function(x){
  z <- epiTab[grep(x,epiTab,fixed=T)]
  r <- unlist(regexpr(":  \\d+\\.*\\d+E*\\+*\\-*\\d*",z))
  return(as.numeric(substr(z,r+3,r+attr(r,"match.length")-1)))
})
prop
#         MP         BP         VP 
# 1.9440e+01 2.3065e+02 7.9800e-05

The data in the acdLabs tab is actually in an HTML table, so we can navigate to the appropriate node and use readHTMLTable(...) to put that into a dataframe. The data frame still needs some tweaking though.

# parse acdLabsTab
acdLabsTab   <- getNodeSet(doc,'//div[@id="acdLabsTab"]/div/div')[[1]]
acdLabs      <- readHTMLTable(acdLabsTab)

Finally, the synonyms tab is a real nightmare. There is a baseline set of synonyms, and also a "more..." link which exposes an additional (more obscure) set. The code bbelow just grabs the baseline set.

# synonyms tab
synNodes <- getNodeSet(doc,'//div[@id="synonymsTab"]/div/div/div/p[@class="syn"]')
synonyms <- sapply(synNodes,function(x)xmlValue(getNodeSet(x,"./strong")[[1]]))
synonyms
#  [1] "1,2,3-Propanetriol" "Bulbold"            "Cristal"            "Glicerol"           "Glyceol"            "Glycerin"           "Glycerin"          
#  [8] "glycerine"          "glycerol"           "GlycÃƒÂ©rol"

回答2:

instead of parsing ChemSpider web page it's much better and easier to use REST API: http://parts.chemspider.com/JSON.ashx

So, in order to get list of synonyms, predicted and experimental properties for compound with ID 733 do this http://parts.chemspider.com/JSON.ashx?op=GetRecordsAsCompounds&csids[0]=733&serfilter=Compound[PredictedProperties|ExperimentalProperties|Synonyms]

来源：https://stackoverflow.com/questions/21713278/scraping-experimentally-measured-physicochemical-properties-and-synonyms-from-ch

标签

xml

web-scraping

rcurl