Web scraping with R over real estate ads

后端 未结 2 524
北海茫月
北海茫月 2021-01-31 12:18

As an intern in an economic research team, I was given the task to find a way to automatically collect specific data on a real estate ad website, using R.

I assume that

相关标签:
2条回答
  • 2021-01-31 13:07

    You can use the XML package in R to scrape this data. Here is a piece of code that should help.

    # DEFINE UTILITY FUNCTIONS
    
    # Function to Get Links to Ads by Page
    get_ad_links = function(page){
      require(XML)
      # construct url to page
      url_base = "http://www.leboncoin.fr/ventes_immobilieres/offres/nord_pas_de_calais/"
      url      = paste(url_base, "?o=", page, "&zz=", 59000, sep = "")
      page     = htmlTreeParse(url, useInternalNodes = T)
    
      # extract links to ads on page
      xp_exp   = "//td/a[contains(@href, 'ventes_immobilieres')]"
      ad_links = xpathSApply(page, xp_exp, xmlGetAttr, "href")
      return(ad_links)  
    }
    
    # Function to Get Ad Details by Ad URL
    get_ad_details = function(ad_url){
       require(XML)
       # parse ad url to html tree
       doc = htmlTreeParse(ad_url, useInternalNodes = T)
    
       # extract labels and values using xpath expression
       labels  = xpathSApply(doc, "//span[contains(@class, 'ad')]/label", xmlValue)
       values1 = xpathSApply(doc, "//span[contains(@class, 'ad')]/strong", xmlValue)
       values2 = xpathSApply(doc, "//span[contains(@class, 'ad')]//a", xmlValue)
       values  = c(values1, values2)
    
       # convert to data frame and add labels
       mydf        = as.data.frame(t(values))
       names(mydf) = labels
       return(mydf)
    }
    

    Here is how you would use these functions to extract information into a data frame.

    # grab ad links from page 1
    ad_links = get_ad_links(page = 1)
    
    # grab ad details for first 5 links from page 1
    require(plyr)
    ad_details = ldply(ad_links[1:5], get_ad_details, .progress = 'text')
    

    This returns the following output

    Prix :     Ville :  Frais d'agence inclus :  Type de bien :  Pièces :  Surface :  Classe énergie :          GES : 
    469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
    469 000 € 59000 Lille                      Oui          Maison         8     250 m2  F (de 331 à 450)           <NA>
    140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
    140 000 € 59000 Lille                     <NA>     Appartement         2      50 m2  D (de 151 à 230) E (de 36 à 55)
    170 000 € 59000 Lille                     <NA>     Appartement      <NA>      50 m2  D (de 151 à 230) D (de 21 à 35)
    

    You can easily use the apply family of functions to loop over multiple pages to get details of all ads. Two things to be mindful of. One is the legality of scraping from the website. Two is to use Sys.sleep in your looping function so that the servers are not bombarded with requests.

    Let me know how this works

    0 讨论(0)
  • 2021-01-31 13:10

    That's quite a big question, so you need to break it down into smaller ones, and see which bits you get stuck on.

    Is the problem with retrieving a web page? (Watch out for proxy server issues.) Or is the tricky bit accessing the useful bits of data from it? (You'll probably need to use xPath for this.)

    Take a look at the web-scraping example on Rosetta code and browse these SO questions for more information.

    0 讨论(0)
提交回复
热议问题