How to convert an portion of an XML into a data frame? (properly)

前端 未结 2 1724
后悔当初
后悔当初 2021-01-01 03:44

I am trying to extract information from an XML file from ClinicalTrials.gov. The file is organized in the following way:


  ...
  

        
相关标签:
2条回答
  • 2021-01-01 04:08

    You could flatten the XML first.

    flatten_xml <- function(x) {
      if (length(xmlChildren(x)) == 0) structure(list(xmlValue(x)), .Names = xmlName(xmlParent(x)))
      else Reduce(append, lapply(xmlChildren(x), flatten_xml))
    }
    
    dfs <- lapply(getNodeSet(xmlDoc,"//location"), function(x) data.frame(flatten_xml(x)))
    allnames <- unique(c(lapply(dfs, colnames), recursive = TRUE))
    df <- do.call(rbind, lapply(dfs, function(df) { df[, setdiff(allnames,colnames(df))] <- NA; df }))
    head(df)
    
     #          city      state   zip       country     status          last_name        phone                    email               last_name.1
     # 1  Birmingham    Alabama 35294 United States Recruiting Louis B Nabors, MD 205-934-1813          bnabors@uab.edu        Louis B Nabors, MD
     # 2      Mobile    Alabama 36604 United States Recruiting Melanie Alford, RN 251-445-9649     malford@usouthal.edu    Pamela Francisco, CCRP
     # 3     Phoenix    Arizona 85013 United States Recruiting     Lynn Ashby, MD 602-406-6262           LASHBY@CHW.EDU            Lynn Ashby, MD
     # 4      Tucson    Arizona 85724 United States Recruiting         Jamie Holt 520-626-6800 jholt1@email.arizona.edu Baldassarre Stea, MD, PhD
     # 5 Little Rock   Arkansas 72205 United States Recruiting   Wilma Brooks, RN 501-686-8530       ALEubanks@uams.edu       Amanda Eubanks, APN
     # 6    Berkeley California 94704 United States  Withdrawn               <NA>         <NA>                     <NA>                      <NA>
    
    0 讨论(0)
  • 2021-01-01 04:17

    This answer converts the XML to a list, unlists each location section, transposes the section, converts the section to a data.table, and then uses rbindlist to merge all of the individual locations into one table. The fill=T argument matches the elements by name, and fills in missing element values with NA.

    library(XML); library(data.table)
    
    clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true"
    xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE)
    
    xmlToDT <- function(doc, path) {
      rbindlist(
        lapply(getNodeSet(doc, path),
               function(x) data.table(t(unlist(xmlToList(x))))
        ), fill=T)
    }
    
    locationDT <- xmlToDT(xmlDoc, "//location")
    locationDT[1:6]
    ##                                                                       facility.name facility.address.city facility.address.state facility.address.zip
    ## 1:                                                                "HYGEIA" Hospital               Marousi     District of Attica               151 23
    ## 2: Allina Health, Abbott Northwestern Hospital, John Nasseff Neuroscience Institute           Minneapolis              Minnesota                55407
    ## 3:                  Amrita Institute of Medical Sciences and Research Centre, Kochi                 Kochi                 Kerala              682 026
    ## 4:                                                      Anne Arundel Medical Center             Annapolis               Maryland                21401
    ## 5:                                                              Atlanta Cancer Care               Atlanta                Georgia                30005
    ## 6:                                                                    Austin Health            Heidelberg               Victoria                 3084
    ##    facility.address.country
    ## 1:                   Greece
    ## 2:            United States
    ## 3:                    India
    ## 4:            United States
    ## 5:            United States
    ## 6:                Australia
    
    0 讨论(0)
提交回复
热议问题