问题
SDMX (Statistical Data and Metadata Exchange) is a 'XML' grammar that defines a standard for exchanging statistical data. It uses files called Dataset Structure Definition Description (DSD) to convey the structure of a dataset. Amongst other things the DSD contains a node Codelists
that is comprised of the Codelist
items which in turn are parent to the Code
and Name
item and attribuet. I am currently trying to parse these Codelists of a DSD file requested from Eurostats REST interface into a list of dataframes in R using the following code:
library(XML);library(RCurl)
# REST resource for DSD of nama_gdp_c
# downloading, parsing XML an setting root
file <- "http://ec.europa.eu/eurostat/SDMX/diss-web/rest/datastructure/ESTAT/DSD_nama_gdp_c"
content <- getURL(file, httpheader = list('User-Agent' = 'R-Agent'))
root <- xmlRoot(xmlInternalTreeParse(content, useInternalNodes = TRUE))
# get Nodeset of Codelists and its length
nodes <- getNodeSet(root,"//str:Codelist")
nn <- length(nodes)
# Create nested List of all Codes and Names
codelistAll <- lapply(seq(nn),function(i){
xpathSApply(root,paste0("//str:Codelist[",i,"]/str:Code"),xmlGetAttr, "id")
})
namelistAll <- lapply(seq(nn),function(i){
xpathSApply(root,paste0("//str:Codelist[",i,"]/str:Code/com:Name"),xmlValue)
})
# Create a list of dataframes from the nested lists
alldfList <-lapply(seq(nn),function(i) data.frame(codes=codelistAll[[i]],names=namelistAll[[i]]))
# Name the list items like the nodes
names(alldfList) <- sapply(nodes, xmlGetAttr,"id")
This yields alldfList
, the list of dataframes which I was looking for.
> str(alldfList)
List of 6
$ CL_FREQ :'data.frame': 6 obs. of 2 variables:
..$ codes: Factor w/ 6 levels "A","D","H","M",..: 2 6 5 1 4 3
..$ names: Factor w/ 6 levels "Annual","Daily",..: 2 6 4 1 3 5
$ CL_GEO :'data.frame': 49 obs. of 2 variables:
..$ codes: Factor w/ 49 levels "AT","BA","BE",..: 22 21 20 10 16 15 14 13 12 11 ...
..$ names: Factor w/ 49 levels "Austria","Belgium",..: 19 18 17 16 15 14 13 12 11 10 ...
While this does the job, I have the feeling that there must be a more straightforward syntax to achieve this. Especially the use of paste0
and the final assignment of names seem awkward. I have been reading through the documentation of the XML
package and I suspect it must be some operation on the xlmChildren
but I cannot wrap my head around how to actually do it. Does anyone have a suggestion for a canonical way of doing this operation? Any suggestion would be greatly appreciated.
回答1:
You can get the data.frames directly from nodes, but need to use a namespace
ns <- c(str="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure")
alldfList <- lapply(nodes, function(x){ data.frame(
codes= xpathSApply(x, ".//str:Code" , xmlGetAttr, "id", namespaces=ns),
names= xpathSApply(x, ".//str:Code" , xmlValue, namespaces=ns) )})
names(alldfList) <- sapply(nodes, xmlGetAttr,"id")
回答2:
As you are trying to read SDMX-ML files in R, you can try the rsdmx package hosted in Github. The package is available for download in CRAN, and the latest version allows you to read Data Structure Definitions (DSDs) and components including Codelists, Concepts and KeyFamilies.
For installation, in case you can anyway easily install it from Github using the following:
require(devtools)
install_github("rsdmx", "opensdmx")
Taking your example for Codelists, you can easily coerce SDMX codelists to data.frame doing the following:
require(rsdmx)
file <- "http://ec.europa.eu/eurostat/SDMX/diss-web/rest/datastructure/ESTAT/DSD_nama_gdp_c"
sdmx <- readSDMX(file)
#get the list of codelist Id
codelists <- sapply(sdmx@codelists, function(x) x@id)
#get some specific codelist as data.frame
codelist <- as.data.frame(sdmx, codelistId = "CL_GEO")
head(codelist)
Similar can be done for SDMX Concepts / ConceptSchemes, complete Data Structure Definitions (DSD), and for sure SDMX datasets. Check out more examples at rsdmx wiki.
Hope this helps!
来源:https://stackoverflow.com/questions/24929109/more-direct-way-to-create-list-of-dataframes-from-xml-file