I have a big bunch of xml
-files, which I need to process. For that matter I want to be able to read the files, and save the resulting list of objects to disk. I
xml2 objects have external pointers that become invalid when you serialize them naively. The package provides xml_serialize()
and xml_unserialize()
objects to handle this for you. Unfortunately the API is slightly cumbersome because base::serialize()
and base::unserialize()
assume an open connection.
library(xml2)
x <- read_xml("<foo>
<bar>text <baz id = 'a' /></bar>
<bar>2</bar>
<baz id = 'b' />
</foo>")
# function to save and read object
roundtrip <- function(obj) {
tf <- tempfile()
con <- file(tf, "wb")
on.exit(unlink(tf))
xml_serialize(obj, con)
close(con)
con <- file(tf, "rb")
on.exit(close(con), add = TRUE)
xml_unserialize(con)
}
x
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
(y <- roundtrip(x))
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
identical(x, y)
#> [1] FALSE
all.equal(x, y)
#> [1] TRUE
xml_children(y)
#> {xml_nodeset (3)}
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
as_list(y)
#> $bar
#> $bar[[1]]
#> [1] "text "
#>
#> $bar$baz
#> list()
#> attr(,"id")
#> [1] "a"
#>
#>
#> $bar
#> $bar[[1]]
#> [1] "2"
#>
#>
#> $baz
#> list()
#> attr(,"id")
#> [1] "b"
Also in regards to the second part of your question, I would seriously consider using XPATH expressions to extract the desired data, even if you have to rewrite code.