Web scraping with R over real estate ads

后端未结

关注

 2  524

As an intern in an economic research team, I was given the task to find a way to automatically collect specific data on a real estate ad website, using R.

I assume that

相关标签:

2条回答

不知归路

2021-01-31 13:07

You can use the XML package in R to scrape this data. Here is a piece of code that should help.

# DEFINE UTILITY FUNCTIONS

# Function to Get Links to Ads by Page
get_ad_links = function(page){
  require(XML)
  # construct url to page
  url_base = "http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/"
  url      = paste(url_base, "?o=", page, "&zz=", 59000, sep = "")
  page     = htmlTreeParse(url, useInternalNodes = T)

  # extract links to ads on page
  xp_exp   = "//td/a[contains(@href, 'ventes_immobilieres')]"
  ad_links = xpathSApply(page, xp_exp, xmlGetAttr, "href")
  return(ad_links)  
}

# Function to Get Ad Details by Ad URL
get_ad_details = function(ad_url){
   require(XML)
   # parse ad url to html tree
   doc = htmlTreeParse(ad_url, useInternalNodes = T)

   # extract labels and values using xpath expression
   labels  = xpathSApply(doc, "//span[contains(@class, 'ad')]/label", xmlValue)
   values1 = xpathSApply(doc, "//span[contains(@class, 'ad')]/strong", xmlValue)
   values2 = xpathSApply(doc, "//span[contains(@class, 'ad')]//a", xmlValue)
   values  = c(values1, values2)

   # convert to data frame and add labels
   mydf        = as.data.frame(t(values))
   names(mydf) = labels
   return(mydf)
}

Here is how you would use these functions to extract information into a data frame.

# grab ad links from page 1
ad_links = get_ad_links(page = 1)

# grab ad details for first 5 links from page 1
require(plyr)
ad_details = ldply(ad_links[1:5], get_ad_details, .progress = 'text')

This returns the following output

Prix :     Ville :  Frais d'agence inclus :  Type de bien :  Pièces :  Surface :  Classe énergie :          GES : 
469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
170 000 € 59000 Lille                     <NA>     Appartement      <NA>      50 m2  D (de 151 à 230) D (de 21 à 35)

You can easily use the apply family of functions to loop over multiple pages to get details of all ads. Two things to be mindful of. One is the legality of scraping from the website. Two is to use Sys.sleep in your looping function so that the servers are not bombarded with requests.

Let me know how this works

0 讨论(0)

离开以前

2021-01-31 13:10

That's quite a big question, so you need to break it down into smaller ones, and see which bits you get stuck on.

Is the problem with retrieving a web page? (Watch out for proxy server issues.) Or is the tricky bit accessing the useful bits of data from it? (You'll probably need to use xPath for this.)

Take a look at the web-scraping example on Rosetta code and browse these SO questions for more information.

0 讨论(0)
发布评论:

提交评论
- 加载中...