Web scraping and looping through pages with R

前端 未结 2 1767
余生分开走
余生分开走 2021-01-13 12:06

I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks). In order to learn scraping, I am

相关标签:
2条回答
  • 2021-01-13 12:40

    Consider several adjustments:

    • Adjust function to receive a URL parameter. Right profilescrape is not used anywhere in function. Function takes whatever URL is assigned in global environment.

      getProfile <- function(URL) { 
         ...
      }
      
    • Adjust the ending of function to return the needed object. Without return, R will return the last line read. Therefore, replace str(onet_df) with return(onet_df).

    • Pass dynamic URL in loop to method without calling function:

      URL <- paste0(...) 
      record_profile <- getProfile(URL)
      
    • Initialize a list with specified length (2 x 20) before loop. Then on each iteration assign to loop index rather than growing object in loop which is memory inefficient.

      MHP_codes <- c(324585, 449807)  #therapist identifier 
      withinpage_codes <- c(1:20)     #therapist running number 
      
      df_list <- vector(mode = "list",
                        length = length(MHP_codes) * length(withinpade_codes))
      
      j <- 1
      for(code1 in withinpage_codes) { 
          for(code2 in MHP_codes) {
              URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf') 
              df_list[[j]] <- tryCatch(getProfile(URL), 
                                       error = function(e) NULL)
              j <- j + 1 
          } 
      }
      
    • Call rbind.fill once outside loop to combine all data frames together

      final_df <- rbind.fill(df_list)
      

    With that said, consider an apply family solution, specifically Map (wrapper to mapply). Doing so, you avoid the bookkeeping of initializing list and incremental variable and you "hide" the loop for compact statement.

    # ALL POSSIBLE PAIRINGS
    web_codes_df <- expand.grid(MHP_codes = c(324585, 449807),
                                withinpage_codes = c(1:20))
    
    # MOVE URL ASSIGNMENT INSIDE FUNCTION
    getProfile <- function(code1, code2) { 
       URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
    
        # ...same code as before...
    }
    
    # ELEMENT-WISE LOOP PASSING PARAMS IN PARALLEL TO FUNCTION
    df_list <- Map(function(code1, code2) tryCatch(getProfile(code1, code2), 
                                                   error = function(e) NULL),
                   code1 = web_codes_df$MHP_codes,
                   code2 = web_codes_df$withinpage_codes)
    
    final_df <- rbind.fill(df_list)
    
    0 讨论(0)
  • 2021-01-13 12:50

    One of the users, Parfait, helped me to sort out the issues. So, a very big thank you goes to this user. Below I post the script. I apologize if it is not presicely commented.

    Here is the code.

    #Loading packages
    library('rvest') #to scrape
    library('xml2')  #to handle missing values (it works with html_node, not with html_nodes)
    library('plyr')  #to bind together different data sets
    
    #get working directory
    getwd()
    setwd("~/YOUR OWN FOLDER HERE")
    
    #DEFINE SCRAPING FUNCTION
    getProfile <- function(URL) {
    
    
              ##NAME
                    #Using CSS selectors to name
                    nam_html <- html_node(URL,'.contact-name')
                    #Converting the name data to text
                    nam <- html_text(nam_html)
                    #Let's have a look at the rankings
                    head(nam)
                    #Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help 
                    #                                   me separate each item within the same type of 
                    #                                   information)
                    nam<-gsub("\n","",nam)
                    head(nam)
                    #Convering each info from text to factor
                    nam<-as.factor(nam)
                    #Let's have a look at the name
                    head(nam)
                    #If I need to remove blank space do this:
                      #Data-Preprocessing: removing excess spaces
                      #variable<-gsub(" ","",variable)
    
    
                ##MODALITIES
                    #Using CSS selectors to modality
                    mod_html <- html_node(URL,'.attributes-modality .copy-small')
                    #Converting the name data to text
                    mod <- html_text(mod_html)
                    #Let's have a look at the rankings
                    head(mod)
                    #Convering each info from text to factor
                    mod<-as.factor(mod)
                    #Let's have a look at the rankings
                    head(mod)
    
                    ##Combining all the lists to form a data frame
                    onet_df<-data.frame(Name = nam,                                                                                     
                                        Modality = mod)
    
                    return(onet_df)
    }
    

    Then, I apply this function with a loop to a few therapists. For illustrative purposes, I take four adjacent therapists' ID, without knowing apriori whether each of these IDs have been actually assigned (this is done because I want to see what happens if the loop stumbles on a non-existen link).

    j <- 1
    MHP_codes <-  c(163805:163808) #therapist identifier
    df_list <- vector(mode = "list", length(MHP_codes))
      for(code1 in MHP_codes) {
        URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
        #Reading the HTML code from the website
        URL <- read_html(URL)
        df_list[[j]] <- tryCatch(getProfile(URL), 
                                 error = function(e) NULL)
        j <- j + 1
      }
    
    final_df <- rbind.fill(df_list)
    save(final_df,file="final_df.Rda")
    
    0 讨论(0)
提交回复
热议问题