I\'m trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I\'ve managed to scrape the HTML and parse it but now a little unsure how to
The very new rvest package makes quick work of this and lets you use sane CSS selectors, too.
UPDATED Incorporates the second request (see comments below)
library(rvest)
library(dplyr)
# gets data from the second popup
# returns a data frame of town_id, town_name, area_id, area_name
addArea <- function(town_id, town_name) {
# make the AJAX URL and grab the data
url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s",
town_id)
subunits <- html(url)
# reformat into a data frame with the town data
data.frame(town_id=town_id,
town_name=town_name,
area_id=subunits %>% html_nodes("option") %>% html_attr("value"),
area_name=subunits %>% html_nodes("option") %>% html_text(),
stringsAsFactors=FALSE)[-1,]
}
# get data from the first popup and put it into a dat a frame
majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"),
town_name=majidata %>% html_nodes("#town option") %>% html_text(),
stringsAsFactors=FALSE)[-1,]
# pass in the name and id to our addArea function and make the result into
# a data frame with all the data (town and area)
combined <- do.call("rbind.data.frame",
mapply(addArea, maji$town_id, maji$town_name,
SIMPLIFY=FALSE, USE.NAMES=FALSE))
# row names aren't super-important, but let's keep them tidy
rownames(combined) <- NULL
str(combined)
## 'data.frame': 1964 obs. of 4 variables:
## $ town_id : chr "611" "635" "625" "628" ...
## $ town_name: chr "AHERO" "AKALA" "AWASI" "AWENDO" ...
## $ area_id : chr "60603030101" "60107050201" "60603020101" "61103040101" ...
## $ area_name: chr "AHERO" "AKALA" "AWASI" "ANINDO" ...
head(combined)
## town_id town_name area_id area_name
## 1 611 AHERO 60603030101 AHERO
## 2 635 AKALA 60107050201 AKALA
## 3 625 AWASI 60603020101 AWASI
## 4 628 AWENDO 61103040101 ANINDO
## 5 628 AWENDO 61103050401 SARE
## 6 749 BAHATI 73101010101 BAHATI
Using xpath expressions with HTML is almost always a better choice than regex. Given this data you can extract what you're after with
options<-getNodeSet(xmlRoot(majidata_html), "//select[@id='town']/option")
ids <- sapply(options, xmlGetAttr, "value")
names <- sapply(options, xmlValue)
data.frame(ID=ids, Name=names)
which returns
ID Name
1 0 [SELECT TOWN]
2 611 AHERO
3 635 AKALA
4 625 AWASI
5 628 AWENDO
6 749 BAHATI
...