I have a series of several thousand URLs that link to prescription drug labels and am trying to figure out how many have a patient package insert. I am attempting to do this by
You can use something like the following:
xData <- lapply(Data$urls, htmlParse)
ppiData <- lapply(xData, FUN = xpathApply, path = "/descendant-or-self::*[contains(@title, 'Patient Package Insert')]", fun = xmlAttrs)
ppiData
[[1]]
[[1]][[1]]
title href class
"Patient Package Insert" "#nlm42230-3" "nlmlinktrue"
[[2]]
[[2]][[1]]
title href class
"Patient Package Insert" "#nlm42230-3" "nlmlinkfalse"
[[3]]
[[3]][[1]]
title href class
"Patient Package Insert" "#nlm42230-3" "nlmlinkfalse"
On this simple example you could process to a dataframe:
ppiData <- lapply(ppiData, function(x){data.frame(as.list(x[[1]]))})
ppiData <- do.call(rbind, ppiData)
> ppiData
title href class
1 Patient Package Insert #nlm42230-3 nlmlinktrue
2 Patient Package Insert #nlm42230-3 nlmlinkfalse
3 Patient Package Insert #nlm42230-3 nlmlinkfalse
with your real data set the 2nd step maybe a bit more involved with multiple entries possible etc.
If you look at the HTML value that is returned, rather than just the greppish value you can find:
$body$div$div$fieldset$div$ul$li$a$.attrs
title href class
"Patient Package Insert" "#nlm42230-3" "nlmlinkfalse"
... but the item above it has a class value of "nlmlinktrue". So maybe you will need to go through all the unfortunately unnamed $body$div$div$fieldset$div$ul$li$a$.text
nodes to find the "Patient Package Insert" item and then see what its $body$div$div$fieldset$div$ul$li$a$.attrs
class value is.
When I do that by hand on the third item I get:
Data$insert[[3]]$body[14]$div[12]$div[2]$fieldset[3]$div[2]$ul[27]$li[2]
$a
$a$text
[1] "Patient Package Insert"
$a$.attrs
title href class
"Patient Package Insert" "#nlm42230-3" "nlmlinkfalse"
To do it by hand you can grep to find the next node that contains "Package Insert"
> head(sapply( Data$insert[[3]], FUN=grep, patt="Package Insert" ))
$head
integer(0)
$body
[1] 14
$.attrs
integer(0)
> head(sapply( Data$insert[[3]]$body[14], FUN=grep, patt="Package Insert" ))
div
12
>