Load XML to Dataframe in R with parent node attributes

☆樱花仙子☆ 提交于 2019-11-27 08:44:24

问题


I have an XML file (a TEI-encoded play) that I want to process into a data.frame in R, where every row of the data.frame contains one line of the play, the line number, the speaker of that line, the scene number, and the scene type. The body of the XML file looks like this (but longer):

<text>
<body>
<div1 type="scene" n="1">
    <sp who="fau">
        <l n="30">Settle thy studies, Faustus, and begin</l>
        <l n="31">To sound the depth of that thou wilt profess;</l>
        <l n="32">Having commenced, be a divine in show,</l>
    </sp>
    <sp who="eang">
        <l n="105">Go forward, Faustus, in that famous art,</l>
    </sp>
</div1>
<div1 type="scene" n="2">
    <sp who="sch1">
        <l n="NA">I wonder what's become of Faustus, that was wont to make our schools ring with sic probo.</l>
    </sp>
    <sp who="sch2">
        <l n="NA">That shall we know, for see here comes his boy.</l>
    </sp>
    <sp who="sch1">
        <l n="NA">How now sirrah, where's thy master?</l>
    </sp>
    <sp who="wag">
        <l n="NA">God in heaven knows.</l>
    </sp>   
</div1>
</body>
</text>

The problem seems similar to questions posed here and here, but my XML file is structured slightly differently, so neither has given me a working solution. I've managed to do this:

library(XML)
doc <- xmlTreeParse("data/faustus_sample.xml", useInternalNodes=TRUE)

bodyToDF <- function(x){
  scenenum <- xmlGetAttr(x, "n")
  scenetype <- xmlGetAttr(x, "type")
  attributes <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlAttrs)
  linecontent <- sapply(xmlChildren(x), xmlValue)
  data.frame(scenenum = scenenum, scenetype = scenetype, attributes = attributes, linecontent = linecontent, stringsAsFactors = FALSE)
}

res <- xpathApply(doc, '//div1', bodyToDF)
temp.df <- do.call(rbind, res)

This returns a data.frame with 'scene number', 'scene type', and 'speaker' intact, but I can't work out how to break it down to each line (and get the associated line number).

I tried importing the file as a list (via xmlToList), but this gave me an incredibly messy list of lists of lists, and it also resulted in a lot of different errors if I attempted to use for loops to access the different elements (terrible idea, I know!).

Ideally, I'm looking for a solution that will work on the full file in all its messiness and also work for other, similarly structured XML files.

I've just started using R and am totally at a loss. Any assistance you can provide will be hugely appreciated.

Thanks for your help!

EDIT: a copy of the full xml file is available here.


回答1:


Added additional xpathApply for sp elements:

bodyToDF <- function(x){
  scenenum <- xmlGetAttr(x, "n")
  scenetype <- xmlGetAttr(x, "type")
  sp <- xpathApply(x, 'sp', function(sp) {
    who <- xmlGetAttr(sp, "who")
    if(is.null(who))
      who <- NA
    line_num <- xpathSApply(sp, 'l', function(l) { xmlGetAttr(l,"n")})
    linecontent = xpathSApply(sp, 'l', function(l) { xmlValue(l,"n")})
    data.frame( scenenum, scenetype, who, line_num, linecontent)
  })
  do.call(rbind, sp)  
}

res <- xpathApply(doc, '//div1', bodyToDF)
temp.df <- do.call(rbind, res)

First 4 columns:

# > temp.df[,1:4]
#   scenenum scenetype  who line_num
# 1        1     scene  fau       30
# 2        1     scene  fau       31
# 3        1     scene  fau       32
# 4        1     scene eang      105
# 5        2     scene sch1       NA
# 6        2     scene sch2       NA
# 7        2     scene sch1       NA
# 8        2     scene  wag       NA


来源:https://stackoverflow.com/questions/28826389/load-xml-to-dataframe-in-r-with-parent-node-attributes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!