Traverse multiple XML documents to find particular attributes using R

前端 未结 2 1665
梦毁少年i
梦毁少年i 2021-01-26 16:19

I have a series of several thousand URLs that link to prescription drug labels and am trying to figure out how many have a patient package insert. I am attempting to do this by

相关标签:
2条回答
  • 2021-01-26 16:44

    You can use something like the following:

    xData <- lapply(Data$urls, htmlParse)
    ppiData <- lapply(xData, FUN = xpathApply, path = "/descendant-or-self::*[contains(@title, 'Patient Package Insert')]", fun = xmlAttrs)
    ppiData
    
    [[1]]
    [[1]][[1]]
                       title                     href                    class 
    "Patient Package Insert"            "#nlm42230-3"            "nlmlinktrue" 
    
    
    [[2]]
    [[2]][[1]]
                       title                     href                    class 
    "Patient Package Insert"            "#nlm42230-3"           "nlmlinkfalse" 
    
    
    [[3]]
    [[3]][[1]]
                       title                     href                    class 
    "Patient Package Insert"            "#nlm42230-3"           "nlmlinkfalse" 
    

    On this simple example you could process to a dataframe:

    ppiData <- lapply(ppiData, function(x){data.frame(as.list(x[[1]]))})
    ppiData <- do.call(rbind, ppiData)
    
    > ppiData
                       title        href        class
    1 Patient Package Insert #nlm42230-3  nlmlinktrue
    2 Patient Package Insert #nlm42230-3 nlmlinkfalse
    3 Patient Package Insert #nlm42230-3 nlmlinkfalse
    

    with your real data set the 2nd step maybe a bit more involved with multiple entries possible etc.

    0 讨论(0)
  • 2021-01-26 16:45

    If you look at the HTML value that is returned, rather than just the greppish value you can find:

    $body$div$div$fieldset$div$ul$li$a$.attrs
                       title                     href                    class 
    "Patient Package Insert"            "#nlm42230-3"           "nlmlinkfalse" 
    

    ... but the item above it has a class value of "nlmlinktrue". So maybe you will need to go through all the unfortunately unnamed $body$div$div$fieldset$div$ul$li$a$.text nodes to find the "Patient Package Insert" item and then see what its $body$div$div$fieldset$div$ul$li$a$.attrs class value is.

    When I do that by hand on the third item I get:

    Data$insert[[3]]$body[14]$div[12]$div[2]$fieldset[3]$div[2]$ul[27]$li[2]
    
    $a
    $a$text
    [1] "Patient Package Insert"
    
    $a$.attrs
                       title                     href                    class 
    "Patient Package Insert"            "#nlm42230-3"           "nlmlinkfalse" 
    

    To do it by hand you can grep to find the next node that contains "Package Insert"

    > head(sapply( Data$insert[[3]], FUN=grep, patt="Package Insert" ))
    $head
    integer(0)
    
    $body
    [1] 14
    
    $.attrs
    integer(0)
    
    > head(sapply( Data$insert[[3]]$body[14], FUN=grep, patt="Package Insert" ))
    div 
     12 
    > 
    
    0 讨论(0)
提交回复
热议问题