Traverse multiple XML documents to find particular attributes using R

前端未结

关注

 2  1666

梦毁少年i 2021-01-26 16:19

I have a series of several thousand URLs that link to prescription drug labels and am trying to figure out how many have a patient package insert. I am attempting to do this by

2条回答

抹茶落季 (楼主)

2021-01-26 16:44

You can use something like the following:

xData <- lapply(Data$urls, htmlParse)
ppiData <- lapply(xData, FUN = xpathApply, path = "/descendant-or-self::*[contains(@title, 'Patient Package Insert')]", fun = xmlAttrs)
ppiData

[[1]]
[[1]][[1]]
                   title                     href                    class 
"Patient Package Insert"            "#nlm42230-3"            "nlmlinktrue" 


[[2]]
[[2]][[1]]
                   title                     href                    class 
"Patient Package Insert"            "#nlm42230-3"           "nlmlinkfalse" 


[[3]]
[[3]][[1]]
                   title                     href                    class 
"Patient Package Insert"            "#nlm42230-3"           "nlmlinkfalse"

On this simple example you could process to a dataframe:

ppiData <- lapply(ppiData, function(x){data.frame(as.list(x[[1]]))})
ppiData <- do.call(rbind, ppiData)

> ppiData
                   title        href        class
1 Patient Package Insert #nlm42230-3  nlmlinktrue
2 Patient Package Insert #nlm42230-3 nlmlinkfalse
3 Patient Package Insert #nlm42230-3 nlmlinkfalse

with your real data set the 2nd step maybe a bit more involved with multiple entries possible etc.

0 讨论(0)

查看其它2个回答