Web scraping and looping through pages with R

前端 未结 2 1769
余生分开走
余生分开走 2021-01-13 12:06

I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks). In order to learn scraping, I am

2条回答
  •  终归单人心
    2021-01-13 12:40

    Consider several adjustments:

    • Adjust function to receive a URL parameter. Right profilescrape is not used anywhere in function. Function takes whatever URL is assigned in global environment.

      getProfile <- function(URL) { 
         ...
      }
      
    • Adjust the ending of function to return the needed object. Without return, R will return the last line read. Therefore, replace str(onet_df) with return(onet_df).

    • Pass dynamic URL in loop to method without calling function:

      URL <- paste0(...) 
      record_profile <- getProfile(URL)
      
    • Initialize a list with specified length (2 x 20) before loop. Then on each iteration assign to loop index rather than growing object in loop which is memory inefficient.

      MHP_codes <- c(324585, 449807)  #therapist identifier 
      withinpage_codes <- c(1:20)     #therapist running number 
      
      df_list <- vector(mode = "list",
                        length = length(MHP_codes) * length(withinpade_codes))
      
      j <- 1
      for(code1 in withinpage_codes) { 
          for(code2 in MHP_codes) {
              URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf') 
              df_list[[j]] <- tryCatch(getProfile(URL), 
                                       error = function(e) NULL)
              j <- j + 1 
          } 
      }
      
    • Call rbind.fill once outside loop to combine all data frames together

      final_df <- rbind.fill(df_list)
      

    With that said, consider an apply family solution, specifically Map (wrapper to mapply). Doing so, you avoid the bookkeeping of initializing list and incremental variable and you "hide" the loop for compact statement.

    # ALL POSSIBLE PAIRINGS
    web_codes_df <- expand.grid(MHP_codes = c(324585, 449807),
                                withinpage_codes = c(1:20))
    
    # MOVE URL ASSIGNMENT INSIDE FUNCTION
    getProfile <- function(code1, code2) { 
       URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
    
        # ...same code as before...
    }
    
    # ELEMENT-WISE LOOP PASSING PARAMS IN PARALLEL TO FUNCTION
    df_list <- Map(function(code1, code2) tryCatch(getProfile(code1, code2), 
                                                   error = function(e) NULL),
                   code1 = web_codes_df$MHP_codes,
                   code2 = web_codes_df$withinpage_codes)
    
    final_df <- rbind.fill(df_list)
    

自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题