I\'m working with XML files from clinicaltrials.gov, which have a structure like this:
...
...
Here is an example
ns <- getNodeSet(xml, '//clinical_results/outcome_list/outcome/analysis_list/analysis/method')
element_cnt <-length(ns))
strings<-paste(sapply(ns, function(x) { xmlValue(x) }),collapse="|"))
This code will put a subset of nodes that correspond to <location>
from a clinical trial into a data frame:
library(XML)
clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true"
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE)
locations <- xmlToDataFrame(getNodeSet(xmlDoc,"//location"))
In this case there are 221 locations. However, the code assumes sort of a flat structure and lumps subnodes together. For example, anything under <facility>
gets concatenated into a single string. I can go into the subnodes and put them one by one into a dataframe.
I don't understand why do you not use again xpathSApply
, to retrieve locations as you already did for titles?!
xpathSApply(xml_doc, "//clinical_study/location" , xmlValue)