I\'m trying to convert an xml file to a dataframe, but the format seems to be off. I\'ve looked at different tutorials and, while I\'ve been moderately succesful at getting
We provide two approaches to parsing the XML. The first (performing a triple iteration over experiment/sample/test) would likely run faster but the second (using a single loop over the test nodes and at each test node reaching back up through the tree to grab its ancestors) has simpler code.
1) Using Lines
in the Note at the end we implement a triple xpathApply/xpathSApply iteration over experiment/sample/test nodes. e
, s
and t
represent the current such node, respectively.
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
do.call("rbind", xpathApply(doc, "//experiment", function(e) {
data.frame(experiment = xmlAttrs(e)[["name"]],
technician = xmlValue(e[["technician"]]),
location = xmlValue(e[["location"]]),
temp = xmlValue(e[["temp"]]),
runtype = xmlValue(e[["runtype"]]),
t(do.call(cbind, xpathApply(e, "sample", function(s) {
sample <- xmlAttrs(s)[["id"]]
xpathSApply(s, "test", function(t) {
c(sample = sample,
test = xmlAttrs(t)[["name"]],
order = xmlAttrs(t)[["order"]],
code = xmlValue(t[["code"]]),
validuntil = xmlValue(t[["validuntil"]]),
baseline = xmlValue(t["meas"][[1]]),
std = xmlValue(t["meas"][[2]]),
data = xmlValue(t["meas"][[3]]),
calc = xmlValue(t[["calc"]]),
result = xmlValue(t[["result"]])
)})}))),
date = xmlAttrs(e)[["date"]],
time = xmlAttrs(e)[["time"]]
)}))
giving:
experiment technician location temp runtype sample test order
1 abc123 "John" "CO" 21.3 "routine" 2323 laslum 3
2 abc123 "John" "CO" 21.3 "routine" 2323 atr 1
3 abc123 "John" "CO" 21.3 "routine" 8979237 absat 2
code validuntil baseline std data calc result date
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131 "OK" 20150731
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731
time
1 113322
2 113322
3 113322
2) This is an alternate approach in which we loop only over the test nodes and then reach upward into the parent and grandparent to get the corresponding sample and experiement info.
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
do.call("rbind", xpathApply(doc, "//test", function(t) { # t is test node
s <- xmlParent(t) # s is sample node
e <- xmlParent(s) # e is experiment node
data.frame(experiment = xmlAttrs(e)[["name"]],
technician = xmlValue(e[["technician"]]),
location = xmlValue(e[["location"]]),
temp = xmlValue(e[["temp"]]),
runtype = xmlValue(e[["runtype"]]),
sample = xmlAttrs(s)[["id"]],
test = xmlAttrs(t)[["name"]],
order = xmlAttrs(t)[["order"]],
code = xmlValue(t[["code"]]),
validuntil = xmlValue(t[["validuntil"]]),
baseline = xmlValue(t["meas"][[1]]),
std = xmlValue(t["meas"][[2]]),
data = xmlValue(t["meas"][[3]]),
calc = xmlValue(t[["calc"]]),
result = xmlValue(t[["result"]]),
date = xmlAttrs(e)[["date"]],
time = xmlAttrs(e)[["time"]]
)
}))
giving:
experiment technician location temp runtype sample test order
1 abc123 "John" "CO" 21.3 "routine" 2323 laslum 3
2 abc123 "John" "CO" 21.3 "routine" 2323 atr 1
3 abc123 "John" "CO" 21.3 "routine" 8979237 absat 2
code validuntil baseline std data calc result date
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131 "OK" 20150731
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731
time
1 113322
2 113322
3 113322
Note 1:
As an aside if you read the input XML file, SEWL.xml, into Excel it will do a reasonable job of putting it into a tabular format although some further processing would be needed to get it into precisely into the form in the question.
Note 2:
The input Lines
as an R object is:
Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
<technician>"John"</technician>
<location>"CO"</location>
<temp scale="celsius">21.3</temp>
<runtype>"routine"</runtype>
<sample id="2323">
<test name="laslum" order="3">
<code>"LL18179"</code>
<validuntil>"2016/08"</validuntil>
<meas name="baseline">0.3248</meas>
<meas name="std">5.4389</meas>
<meas name="data">6.5980</meas>
<calc>1.2131</calc>
<result>"OK"</result>
</test>
<test name="atr" order="1">
<code>"ATR150607"</code>
<validuntil>"2017/05"</validuntil>
<meas name="baseline">0.0673</meas>
<meas name="std">4.9721</meas>
<meas name="data">10.3851</meas>
<calc>2.0886</calc>
<result>"Warning"</result>
</test>
</sample>
<sample id="8979237">
<test name="absat" order="2">
<code>"AA09453"</code>
<validuntil>"2016/03"</validuntil>
<meas name="baseline">0.0117</meas>
<meas name="std">5.6012</meas>
<meas name="data">1.1431</meas>
<calc>0.2041</calc>
<result>"FAIL"</result>
</test>
</sample>
</experiment>'