问题
I have a bunch of XML files and an R script that reads their content into a data frame. However, I got now files which I wanted to parse as usual, but there is something in their namespace definition that doesn't allow me to pick their values normally with XPath expressions.
XML files are like this:
xml_nons.xml
<?xml version="1.0" encoding="UTF-8"?>
<XML>
<Node>
<Name>Name 1</Name>
<Title>Title 1</Title>
<Date>2015</Date>
</Node>
</XML>
And the other:
xml_ns.xml
<?xml version="1.0" encoding="UTF-8"?>
<XML xmlns="http://www.nonexistingsite.com">
<Node>
<Name>Name 2</Name>
<Title>Title 2</Title>
<Date>2014</Date>
</Node>
</XML>
The URL where xmlns points to doesn't exist.
The R code I use is like this:
library(XML)
xmlfiles <- list.files(path = ".",
pattern="*.xml$",
full.names = TRUE,
recursive = TRUE)
n <- length(xmlfiles)
dat <- vector("list", n)
for(i in 1:n){
doc <- xmlTreeParse(xmlfiles[i], useInternalNodes = TRUE)
nodes <- getNodeSet(doc, "//XML")
x <- lapply(nodes, function(x){ data.frame(
Filename = xmlfiles[i],
Name = xpathSApply(x, ".//Node/Name" , xmlValue),
Title = xpathSApply(x, ".//Node/Title" , xmlValue),
Date = xpathSApply(x, ".//Node/Date" , xmlValue)
)})
dat[[i]] <- do.call("rbind", x)
}
xml <- do.call("rbind", dat)
xml
However, what I get as a result is:
Filename Name Title Date
./xml_nons.xml Name 1 Title 1 2015
If I remove the namespace link from the second file I get correct:
Filename Name Title Date
./xml_nons_1.xml Name 1 Title 1 2015
./xml_ns_1.xml Name 2 Title 2 2014
Of course I could have an XSL to remove those namespaces from original XML files, but I would like to have some solution that works within R. Is there some way to tell R just to ignore everything in the XML declaration?
回答1:
I think there is no easy way to ignore the namespaces. The best way is to learn to live with them. This answer will use the newer XML2 package. But the same applies to the XML package solution.
Use
library(XML2)
fname='myfile.xml'
doc <- read_xml(fname)
#peak at the namespaces
xml_ns(doc)
The first namespace is assigned to d1. If you XPath does not find what you want, the most likely cause is the namespace issue.
xpath <- "//d1:FormDef"
ns <- xml_find_all(doc,xpath, xml_ns(doc))
ns
Also, you have to do this for every element in the path So to save typing, you can do
library(stringr)
> xpath <- "/ODM/Study"
> (xpath<-str_replace_all(xpath,'/','/d1:'))
[1] "/d1:ODM/d1:Study"
来源:https://stackoverflow.com/questions/29170161/parsing-xml-in-r-incorrect-namespaces