Parsing large XML to dataframe in R

问题

I have large XML files that I want to turn into dataframes for further processing within R and other programs. This is all being done in macOS.

Each monthly XML is around 1gb large, has 150k records and 191 different variables. In the end I might not need the full 191 variables but I'd like to keep them and decide later.

The XML files can be accessed here (scroll to the bottom for the monthly zips, when uncompressed one should look at "dming" XMLs)

I've made some progress but processing for larger files takes too long (see below)

The XML looks like this:

<ROOT>
 <ROWSET_DUASDIA>
  <ROW_DUASDIA NUM="1">
   <variable1>value</variable1>
   ...
   <variable191>value</variable191>
  </ROW_DUASDIA>
  ...
  <ROW_DUASDIA NUM="150236">
   <variable1>value</variable1>
   ...
   <variable191>value</variable191>
  </ROW_DUASDIA>
 </ROWSET_DUASDIA>
</ROOT>

I hope that's clear enough. This is my first time working with an XML.

I've looked at many answers here and in fact managed to get the data into a dataframe using a smaller sample (using a daily XML instead of the monthly ones) and xml2. Here's what I did

library(xml2) 

raw <- read_xml(filename)

# Find all records
dua <- xml_find_all(raw,"//ROW_DUASDIA")

# Create empty dataframe
dualen <- length(dua)
varlen <- length(xml_children(dua[[1]]))
df <- data.frame(matrix(NA,nrow=dualen,ncol=varlen))

# For loop to enter the data for each record in each row
for (j in 1:dualen) {
  df[j, ] <- xml_text(xml_children(dua[[j]]),trim=TRUE)
}

# Name columns
colnames(df) <- c(names(as_list(dua[[1]])))

I imagine that's fairly rudimentary but I'm also pretty new to R.

Anyway, this works fine with daily data (4-5k records), but it's probably too inefficient for 150k records, and in fact I waited a couple hours and it hadn't finished. Granted, I would only need to run this code once a month but I would like to improve it nonetheless.

I tried to turn the elements for all records into a list using the as_list function within xml2 so I could continue with plyr, but this also took too long.

Thanks in advance.

回答1:

While there is no guarantee of better performance on larger XML files, the ("old school") XML package maintains a compact data frame handler, xmlToDataFrame, for flat XML files like yours. Any missing nodes available in other siblings result in NA for corresponding fields.

library(XML)

doc <- xmlParse("/path/to/file.xml")
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA"))

You can even conceivably download the daily zips, unzip need XML, and parse it into data frame should the large monthly XMLs pose memory challenges. As example, below extracts December 2018 daily data into a list of data frames to be row binded at end. Process even adds a DDate field. Method is wrapped in a tryCatch due to missing days in sequence or other URL or zip issues.

dec_urls <- paste0(1201:1231)
temp_zip <- "/path/to/temp.zip"
xml_folder <- "/path/to/xml/folder"

xml_process <- function(dt) {      
  tryCatch({
    # DOWNLOAD ZIP TO URL
    url <- paste0("ftp://ftp.aduanas.gub.uy/DUA%20Diarios%20XML/2018/dd2018", dt,".zip")
    file <- paste0(xml_folder, "/dding2018", dt, ".xml")

    download.file(url, temp_zip)
    unzip(temp_zip, files=paste0("dding2018", dt, ".xml"), exdir=xml_folder)
    unlink(temp_zip)           # DESTROY TEMP ZIP

    # PARSE XML TO DATA FRAME
    doc <- xmlParse(file)        
    df <- transform(xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA")),
                    DDate = as.Date(paste("2018", dt), format="%Y%m%d", origin="1970-01-01"))
    unlink(file)               # DESTROY TEMP XML

    # RETURN XML DF
    return(df)
  }, error = function(e) NA)      
}

# BUILD LIST OF DATA FRAMES
dec_df_list <- lapply(dec_urls, xml_process)

# FILTER OUT "NAs" CAUGHT IN tryCatch
dec_df_list <- Filter(NROW, dec_df_list)

# ROW BIND TO FINAL SINGLE DATA FRAME
dec_final_df <- do.call(rbind, dec_df_list)

回答2:

Here is a solution that processes the entire document at once as opposed to reading each of the 150,000 records in the loop. This should provide a significant performance boost.

This version can also handle cases where the number of variables per record is different.

library(xml2)
doc<-read_xml('<ROOT>
 <ROWSET_DUASDIA>
              <ROW_DUASDIA NUM="1">
              <variable1>value1</variable1>
              <variable191>value2</variable191>
              </ROW_DUASDIA>
              <ROW_DUASDIA NUM="150236">
              <variable1>value3</variable1>
              <variable2>value_new</variable2>
              <variable191>value4</variable191>
              </ROW_DUASDIA>
              </ROWSET_DUASDIA>
              </ROOT>')

#find all of the nodes/records
nodes<-xml_find_all(doc, ".//ROW_DUASDIA")

#find the record NUM and the number of variables under each record
nodenum<-xml_attr(nodes, "NUM")
nodeslength<-xml_length(nodes)

#find the variable names and values
nodenames<-xml_name(xml_children(nodes))
nodevalues<-trimws(xml_text(xml_children(nodes)))

#create dataframe
df<-data.frame(NUM=rep(nodenum, times=nodeslength), 
       variable=nodenames, values=nodevalues, stringsAsFactors = FALSE)

#dataframe is in a long format.  
#Use the function cast, or spread from the tidyr to convert wide format
#      NUM    variable    values
# 1      1   variable1    value1
# 2      1 variable191    value2
# 3 150236   variable1    value3
# 4 150236   variable2 value_new
# 5 150236 variable191    value4

#Convert to wide format
library(tidyr)
spread(df, variable, values)

来源：https://stackoverflow.com/questions/53949273/parsing-large-xml-to-dataframe-in-r

标签

xml

xml2