As an intern in an economic research team, I was given the task to find a way to automatically collect specific data on a real estate ad website, using R.
I assume that
You can use the XML
package in R to scrape this data. Here is a piece of code that should help.
# DEFINE UTILITY FUNCTIONS
# Function to Get Links to Ads by Page
get_ad_links = function(page){
require(XML)
# construct url to page
url_base = "http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/"
url = paste(url_base, "?o=", page, "&zz=", 59000, sep = "")
page = htmlTreeParse(url, useInternalNodes = T)
# extract links to ads on page
xp_exp = "//td/a[contains(@href, 'ventes_immobilieres')]"
ad_links = xpathSApply(page, xp_exp, xmlGetAttr, "href")
return(ad_links)
}
# Function to Get Ad Details by Ad URL
get_ad_details = function(ad_url){
require(XML)
# parse ad url to html tree
doc = htmlTreeParse(ad_url, useInternalNodes = T)
# extract labels and values using xpath expression
labels = xpathSApply(doc, "//span[contains(@class, 'ad')]/label", xmlValue)
values1 = xpathSApply(doc, "//span[contains(@class, 'ad')]/strong", xmlValue)
values2 = xpathSApply(doc, "//span[contains(@class, 'ad')]//a", xmlValue)
values = c(values1, values2)
# convert to data frame and add labels
mydf = as.data.frame(t(values))
names(mydf) = labels
return(mydf)
}
Here is how you would use these functions to extract information into a data frame.
# grab ad links from page 1
ad_links = get_ad_links(page = 1)
# grab ad details for first 5 links from page 1
require(plyr)
ad_details = ldply(ad_links[1:5], get_ad_details, .progress = 'text')
This returns the following output
Prix : Ville : Frais d'agence inclus : Type de bien : Pièces : Surface : Classe énergie : GES :
469 000 € 59000 Lille Oui Maison 8 250 m2 F (de 331 à 450)
469 000 € 59000 Lille Oui Maison 8 250 m2 F (de 331 à 450)
140 000 € 59000 Lille Appartement 2 50 m2 D (de 151 à 230) E (de 36 à 55)
140 000 € 59000 Lille Appartement 2 50 m2 D (de 151 à 230) E (de 36 à 55)
170 000 € 59000 Lille Appartement 50 m2 D (de 151 à 230) D (de 21 à 35)
You can easily use the apply
family of functions to loop over multiple pages to get details of all ads. Two things to be mindful of. One is the legality of scraping from the website. Two is to use Sys.sleep
in your looping function so that the servers are not bombarded with requests.
Let me know how this works