Parsing XML in R: Incorrect namespaces

我的梦境 提交于 2019-12-07 02:19:53

问题


I have a bunch of XML files and an R script that reads their content into a data frame. However, I got now files which I wanted to parse as usual, but there is something in their namespace definition that doesn't allow me to pick their values normally with XPath expressions.

XML files are like this:

xml_nons.xml

<?xml version="1.0" encoding="UTF-8"?>
<XML>
   <Node>
      <Name>Name 1</Name>
      <Title>Title 1</Title>
      <Date>2015</Date>
   </Node>
</XML>

And the other:

xml_ns.xml

<?xml version="1.0" encoding="UTF-8"?>
<XML xmlns="http://www.nonexistingsite.com">
   <Node>
      <Name>Name 2</Name>
      <Title>Title 2</Title>
      <Date>2014</Date>
   </Node>
</XML>

The URL where xmlns points to doesn't exist.

The R code I use is like this:

library(XML)

xmlfiles <- list.files(path = ".", 
                       pattern="*.xml$", 
                       full.names = TRUE, 
                       recursive = TRUE)

n <- length(xmlfiles)
dat <- vector("list", n)

for(i in 1:n){
       doc <- xmlTreeParse(xmlfiles[i], useInternalNodes = TRUE)
       nodes <- getNodeSet(doc, "//XML")
       x <- lapply(nodes, function(x){ data.frame(
              Filename = xmlfiles[i],
              Name = xpathSApply(x, ".//Node/Name" , xmlValue),
              Title = xpathSApply(x, ".//Node/Title" , xmlValue),
              Date = xpathSApply(x, ".//Node/Date" , xmlValue)
            )})
            dat[[i]] <- do.call("rbind", x)
    }

    xml <- do.call("rbind", dat)
    xml

However, what I get as a result is:

Filename            Name    Title    Date
./xml_nons.xml      Name 1  Title 1  2015

If I remove the namespace link from the second file I get correct:

Filename            Name    Title    Date
./xml_nons_1.xml    Name 1  Title 1  2015
./xml_ns_1.xml      Name 2  Title 2  2014

Of course I could have an XSL to remove those namespaces from original XML files, but I would like to have some solution that works within R. Is there some way to tell R just to ignore everything in the XML declaration?


回答1:


I think there is no easy way to ignore the namespaces. The best way is to learn to live with them. This answer will use the newer XML2 package. But the same applies to the XML package solution.

Use

library(XML2)
fname='myfile.xml'
doc <- read_xml(fname)
#peak at the namespaces
xml_ns(doc)

The first namespace is assigned to d1. If you XPath does not find what you want, the most likely cause is the namespace issue.

xpath <-  "//d1:FormDef"
ns <- xml_find_all(doc,xpath, xml_ns(doc))
ns

Also, you have to do this for every element in the path So to save typing, you can do

library(stringr)
> xpath <-  "/ODM/Study"
> (xpath<-str_replace_all(xpath,'/','/d1:'))
[1] "/d1:ODM/d1:Study"


来源:https://stackoverflow.com/questions/29170161/parsing-xml-in-r-incorrect-namespaces

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!