Scrape values from HTML select/option tags in R

前端 未结 2 428
长情又很酷
长情又很酷 2021-01-21 18:59

I\'m trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I\'ve managed to scrape the HTML and parse it but now a little unsure how to

相关标签:
2条回答
  • 2021-01-21 19:29

    The very new rvest package makes quick work of this and lets you use sane CSS selectors, too.

    UPDATED Incorporates the second request (see comments below)

    library(rvest)
    library(dplyr)
    
    # gets data from the second popup
    # returns a data frame of town_id, town_name, area_id, area_name
    addArea <- function(town_id, town_name) {
    
      # make the AJAX URL and grab the data
      url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s",
                     town_id)
      subunits <- html(url)
    
      # reformat into a data frame with the town data
      data.frame(town_id=town_id,
                 town_name=town_name,
                 area_id=subunits %>% html_nodes("option") %>% html_attr("value"),
                 area_name=subunits %>% html_nodes("option") %>% html_text(),
                 stringsAsFactors=FALSE)[-1,]
    
    }
    
    # get data from the first popup and put it into a dat a frame
    majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
    maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"),
                       town_name=majidata %>% html_nodes("#town option") %>% html_text(),
                       stringsAsFactors=FALSE)[-1,]
    
    # pass in the name and id to our addArea function and make the result into
    # a data frame with all the data (town and area)
    combined <- do.call("rbind.data.frame",
                        mapply(addArea, maji$town_id,  maji$town_name,
                               SIMPLIFY=FALSE, USE.NAMES=FALSE))
    
    # row names aren't super-important, but let's keep them tidy
    rownames(combined) <- NULL
    
    str(combined)
    
    ## 'data.frame':    1964 obs. of  4 variables:
    ##  $ town_id  : chr  "611" "635" "625" "628" ...
    ##  $ town_name: chr  "AHERO" "AKALA" "AWASI" "AWENDO" ...
    ##  $ area_id  : chr  "60603030101" "60107050201" "60603020101" "61103040101" ...
    ##  $ area_name: chr  "AHERO" "AKALA" "AWASI" "ANINDO" ...
    
    
    head(combined)
    
    ##   town_id town_name     area_id area_name
    ## 1     611     AHERO 60603030101     AHERO
    ## 2     635     AKALA 60107050201     AKALA
    ## 3     625     AWASI 60603020101     AWASI
    ## 4     628    AWENDO 61103040101    ANINDO
    ## 5     628    AWENDO 61103050401      SARE
    ## 6     749    BAHATI 73101010101    BAHATI
    
    0 讨论(0)
  • 2021-01-21 19:30

    Using xpath expressions with HTML is almost always a better choice than regex. Given this data you can extract what you're after with

    options<-getNodeSet(xmlRoot(majidata_html), "//select[@id='town']/option")
    
    ids <- sapply(options, xmlGetAttr, "value")
    names <- sapply(options, xmlValue)
    
    data.frame(ID=ids, Name=names)
    

    which returns

       ID          Name
    1   0 [SELECT TOWN]
    2 611         AHERO
    3 635         AKALA
    4 625         AWASI
    5 628        AWENDO
    6 749        BAHATI
    ...
    
    0 讨论(0)
提交回复
热议问题