问题
I have an XML file (a TEI-encoded play) that I want to process into a data.frame in R, where every row of the data.frame contains one line of the play, the line number, the speaker of that line, the scene number, and the scene type. The body of the XML file looks like this (but longer):
<text>
<body>
<div1 type="scene" n="1">
<sp who="fau">
<l n="30">Settle thy studies, Faustus, and begin</l>
<l n="31">To sound the depth of that thou wilt profess;</l>
<l n="32">Having commenced, be a divine in show,</l>
</sp>
<sp who="eang">
<l n="105">Go forward, Faustus, in that famous art,</l>
</sp>
</div1>
<div1 type="scene" n="2">
<sp who="sch1">
<l n="NA">I wonder what's become of Faustus, that was wont to make our schools ring with sic probo.</l>
</sp>
<sp who="sch2">
<l n="NA">That shall we know, for see here comes his boy.</l>
</sp>
<sp who="sch1">
<l n="NA">How now sirrah, where's thy master?</l>
</sp>
<sp who="wag">
<l n="NA">God in heaven knows.</l>
</sp>
</div1>
</body>
</text>
The problem seems similar to questions posed here and here, but my XML file is structured slightly differently, so neither has given me a working solution. I've managed to do this:
library(XML)
doc <- xmlTreeParse("data/faustus_sample.xml", useInternalNodes=TRUE)
bodyToDF <- function(x){
scenenum <- xmlGetAttr(x, "n")
scenetype <- xmlGetAttr(x, "type")
attributes <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlAttrs)
linecontent <- sapply(xmlChildren(x), xmlValue)
data.frame(scenenum = scenenum, scenetype = scenetype, attributes = attributes, linecontent = linecontent, stringsAsFactors = FALSE)
}
res <- xpathApply(doc, '//div1', bodyToDF)
temp.df <- do.call(rbind, res)
This returns a data.frame with 'scene number', 'scene type', and 'speaker' intact, but I can't work out how to break it down to each line (and get the associated line number).
I tried importing the file as a list (via xmlToList), but this gave me an incredibly messy list of lists of lists, and it also resulted in a lot of different errors if I attempted to use for loops to access the different elements (terrible idea, I know!).
Ideally, I'm looking for a solution that will work on the full file in all its messiness and also work for other, similarly structured XML files.
I've just started using R and am totally at a loss. Any assistance you can provide will be hugely appreciated.
Thanks for your help!
EDIT: a copy of the full xml file is available here.
回答1:
Added additional xpathApply for sp elements:
bodyToDF <- function(x){
scenenum <- xmlGetAttr(x, "n")
scenetype <- xmlGetAttr(x, "type")
sp <- xpathApply(x, 'sp', function(sp) {
who <- xmlGetAttr(sp, "who")
if(is.null(who))
who <- NA
line_num <- xpathSApply(sp, 'l', function(l) { xmlGetAttr(l,"n")})
linecontent = xpathSApply(sp, 'l', function(l) { xmlValue(l,"n")})
data.frame( scenenum, scenetype, who, line_num, linecontent)
})
do.call(rbind, sp)
}
res <- xpathApply(doc, '//div1', bodyToDF)
temp.df <- do.call(rbind, res)
First 4 columns:
# > temp.df[,1:4]
# scenenum scenetype who line_num
# 1 1 scene fau 30
# 2 1 scene fau 31
# 3 1 scene fau 32
# 4 1 scene eang 105
# 5 2 scene sch1 NA
# 6 2 scene sch2 NA
# 7 2 scene sch1 NA
# 8 2 scene wag NA
来源:https://stackoverflow.com/questions/28826389/load-xml-to-dataframe-in-r-with-parent-node-attributes