I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks). In order to learn scraping, I am
Consider several adjustments:
Adjust function to receive a URL parameter. Right profilescrape is not used anywhere in function. Function takes whatever URL is assigned in global environment.
getProfile <- function(URL) {
...
}
Adjust the ending of function to return the needed object. Without return
, R will return the last line read. Therefore, replace str(onet_df)
with return(onet_df)
.
Pass dynamic URL in loop to method without calling function
:
URL <- paste0(...)
record_profile <- getProfile(URL)
Initialize a list with specified length (2 x 20) before loop. Then on each iteration assign to loop index rather than growing object in loop which is memory inefficient.
MHP_codes <- c(324585, 449807) #therapist identifier
withinpage_codes <- c(1:20) #therapist running number
df_list <- vector(mode = "list",
length = length(MHP_codes) * length(withinpade_codes))
j <- 1
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NULL)
j <- j + 1
}
}
Call rbind.fill
once outside loop to combine all data frames together
final_df <- rbind.fill(df_list)
With that said, consider an apply family solution, specifically Map
(wrapper to mapply
). Doing so, you avoid the bookkeeping of initializing list and incremental variable and you "hide" the loop for compact statement.
# ALL POSSIBLE PAIRINGS
web_codes_df <- expand.grid(MHP_codes = c(324585, 449807),
withinpage_codes = c(1:20))
# MOVE URL ASSIGNMENT INSIDE FUNCTION
getProfile <- function(code1, code2) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
# ...same code as before...
}
# ELEMENT-WISE LOOP PASSING PARAMS IN PARALLEL TO FUNCTION
df_list <- Map(function(code1, code2) tryCatch(getProfile(code1, code2),
error = function(e) NULL),
code1 = web_codes_df$MHP_codes,
code2 = web_codes_df$withinpage_codes)
final_df <- rbind.fill(df_list)