Transforming data from xml into R dataframe

前端 未结 1 1079
清酒与你
清酒与你 2020-12-06 08:10

I\'m trying to convert an xml file to a dataframe, but the format seems to be off. I\'ve looked at different tutorials and, while I\'ve been moderately succesful at getting

1条回答
  •  有刺的猬
    2020-12-06 09:05

    We provide two approaches to parsing the XML. The first (performing a triple iteration over experiment/sample/test) would likely run faster but the second (using a single loop over the test nodes and at each test node reaching back up through the tree to grab its ancestors) has simpler code.

    1) Using Lines in the Note at the end we implement a triple xpathApply/xpathSApply iteration over experiment/sample/test nodes. e, s and t represent the current such node, respectively.

    library(XML)
    doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
    
    do.call("rbind", xpathApply(doc, "//experiment", function(e) {
      data.frame(experiment = xmlAttrs(e)[["name"]],
           technician = xmlValue(e[["technician"]]),
           location = xmlValue(e[["location"]]),
           temp = xmlValue(e[["temp"]]),
           runtype = xmlValue(e[["runtype"]]),
           t(do.call(cbind, xpathApply(e, "sample", function(s) {
                sample <- xmlAttrs(s)[["id"]]
                xpathSApply(s, "test", function(t) {
                       c(sample = sample,
                            test = xmlAttrs(t)[["name"]],
                            order = xmlAttrs(t)[["order"]],
                            code = xmlValue(t[["code"]]),
                            validuntil = xmlValue(t[["validuntil"]]),
                            baseline = xmlValue(t["meas"][[1]]),
                            std = xmlValue(t["meas"][[2]]),
                            data = xmlValue(t["meas"][[3]]),
                            calc = xmlValue(t[["calc"]]),
                            result = xmlValue(t[["result"]])
                 )})}))),
           date = xmlAttrs(e)[["date"]],
           time = xmlAttrs(e)[["time"]]
    )}))
    

    giving:

      experiment technician location temp   runtype  sample   test order
    1     abc123     "John"     "CO" 21.3 "routine"    2323 laslum     3
    2     abc123     "John"     "CO" 21.3 "routine"    2323    atr     1
    3     abc123     "John"     "CO" 21.3 "routine" 8979237  absat     2
             code validuntil baseline    std    data   calc    result     date
    1   "LL18179"  "2016/08"   0.3248 5.4389  6.5980 1.2131      "OK" 20150731
    2 "ATR150607"  "2017/05"   0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
    3   "AA09453"  "2016/03"   0.0117 5.6012  1.1431 0.2041    "FAIL" 20150731
        time
    1 113322
    2 113322
    3 113322
    

    2) This is an alternate approach in which we loop only over the test nodes and then reach upward into the parent and grandparent to get the corresponding sample and experiement info.

    library(XML)
    doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
    
    do.call("rbind", xpathApply(doc, "//test", function(t) { # t is test node
            s <- xmlParent(t) # s is sample node
            e <- xmlParent(s) # e is experiment node
            data.frame(experiment = xmlAttrs(e)[["name"]],
              technician = xmlValue(e[["technician"]]),
              location = xmlValue(e[["location"]]),
              temp = xmlValue(e[["temp"]]),
              runtype = xmlValue(e[["runtype"]]),
              sample = xmlAttrs(s)[["id"]],
              test = xmlAttrs(t)[["name"]],
              order = xmlAttrs(t)[["order"]],
              code = xmlValue(t[["code"]]),
              validuntil = xmlValue(t[["validuntil"]]),
              baseline = xmlValue(t["meas"][[1]]),
              std = xmlValue(t["meas"][[2]]),
              data = xmlValue(t["meas"][[3]]),
              calc = xmlValue(t[["calc"]]),
              result = xmlValue(t[["result"]]),
              date = xmlAttrs(e)[["date"]],
              time = xmlAttrs(e)[["time"]]
           )
    }))
    

    giving:

      experiment technician location temp   runtype  sample   test order
    1     abc123     "John"     "CO" 21.3 "routine"    2323 laslum     3
    2     abc123     "John"     "CO" 21.3 "routine"    2323    atr     1
    3     abc123     "John"     "CO" 21.3 "routine" 8979237  absat     2
             code validuntil baseline    std    data   calc    result     date
    1   "LL18179"  "2016/08"   0.3248 5.4389  6.5980 1.2131      "OK" 20150731
    2 "ATR150607"  "2017/05"   0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
    3   "AA09453"  "2016/03"   0.0117 5.6012  1.1431 0.2041    "FAIL" 20150731
        time
    1 113322
    2 113322
    3 113322
    

    Note 1:

    As an aside if you read the input XML file, SEWL.xml, into Excel it will do a reasonable job of putting it into a tabular format although some further processing would be needed to get it into precisely into the form in the question.

    Note 2:

    The input Lines as an R object is:

    Lines <- '
    
        "John"
        "CO"
        21.3
        "routine"
        
            
                "LL18179"
                "2016/08"
                0.3248
                5.4389
                6.5980
                1.2131
                "OK"
            
            
                "ATR150607"
                "2017/05"
                0.0673
                4.9721
                10.3851
                2.0886
                "Warning"
            
        
        
            
                "AA09453"
                "2016/03"
                0.0117
                5.6012
                1.1431
                0.2041
                "FAIL"
            
        
    '
    

    0 讨论(0)
提交回复
热议问题