Web scraping and looping through pages with R

前端未结

关注

 2  1772

I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks). In order to learn scraping, I am

相关标签:

2条回答

终归单人心

2021-01-13 12:40

Consider several adjustments:

Adjust function to receive a URL parameter. Right profilescrape is not used anywhere in function. Function takes whatever URL is assigned in global environment.
```
getProfile <- function(URL) { 
   ...
}
```
Adjust the ending of function to return the needed object. Without return, R will return the last line read. Therefore, replace str(onet_df) with return(onet_df).
Pass dynamic URL in loop to method without calling function:
```
URL <- paste0(...) 
record_profile <- getProfile(URL)
```

Initialize a list with specified length (2 x 20) before loop. Then on each iteration assign to loop index rather than growing object in loop which is memory inefficient.

MHP_codes <- c(324585, 449807)  #therapist identifier 
withinpage_codes <- c(1:20)     #therapist running number 

df_list <- vector(mode = "list",
                  length = length(MHP_codes) * length(withinpade_codes))

j <- 1
for(code1 in withinpage_codes) { 
    for(code2 in MHP_codes) {
        URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf') 
        df_list[[j]] <- tryCatch(getProfile(URL), 
                                 error = function(e) NULL)
        j <- j + 1 
    } 
}

Call rbind.fill once outside loop to combine all data frames together
```
final_df <- rbind.fill(df_list)
```

With that said, consider an apply family solution, specifically Map (wrapper to mapply). Doing so, you avoid the bookkeeping of initializing list and incremental variable and you "hide" the loop for compact statement.

# ALL POSSIBLE PAIRINGS
web_codes_df <- expand.grid(MHP_codes = c(324585, 449807),
                            withinpage_codes = c(1:20))

# MOVE URL ASSIGNMENT INSIDE FUNCTION
getProfile <- function(code1, code2) { 
   URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')

    # ...same code as before...
}

# ELEMENT-WISE LOOP PASSING PARAMS IN PARALLEL TO FUNCTION
df_list <- Map(function(code1, code2) tryCatch(getProfile(code1, code2), 
                                               error = function(e) NULL),
               code1 = web_codes_df$MHP_codes,
               code2 = web_codes_df$withinpage_codes)

final_df <- rbind.fill(df_list)

0 讨论(0)

渐次进展

2021-01-13 12:50

One of the users, Parfait, helped me to sort out the issues. So, a very big thank you goes to this user. Below I post the script. I apologize if it is not presicely commented.

Here is the code.

#Loading packages
library('rvest') #to scrape
library('xml2')  #to handle missing values (it works with html_node, not with html_nodes)
library('plyr')  #to bind together different data sets

#get working directory
getwd()
setwd("~/YOUR OWN FOLDER HERE")

#DEFINE SCRAPING FUNCTION
getProfile <- function(URL) {


          ##NAME
                #Using CSS selectors to name
                nam_html <- html_node(URL,'.contact-name')
                #Converting the name data to text
                nam <- html_text(nam_html)
                #Let's have a look at the rankings
                head(nam)
                #Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help 
                #                                   me separate each item within the same type of 
                #                                   information)
                nam<-gsub("\n","",nam)
                head(nam)
                #Convering each info from text to factor
                nam<-as.factor(nam)
                #Let's have a look at the name
                head(nam)
                #If I need to remove blank space do this:
                  #Data-Preprocessing: removing excess spaces
                  #variable<-gsub(" ","",variable)


            ##MODALITIES
                #Using CSS selectors to modality
                mod_html <- html_node(URL,'.attributes-modality .copy-small')
                #Converting the name data to text
                mod <- html_text(mod_html)
                #Let's have a look at the rankings
                head(mod)
                #Convering each info from text to factor
                mod<-as.factor(mod)
                #Let's have a look at the rankings
                head(mod)

                ##Combining all the lists to form a data frame
                onet_df<-data.frame(Name = nam,                                                                                     
                                    Modality = mod)

                return(onet_df)
}

Then, I apply this function with a loop to a few therapists. For illustrative purposes, I take four adjacent therapists' ID, without knowing apriori whether each of these IDs have been actually assigned (this is done because I want to see what happens if the loop stumbles on a non-existen link).

j <- 1
MHP_codes <-  c(163805:163808) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
  for(code1 in MHP_codes) {
    URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
    #Reading the HTML code from the website
    URL <- read_html(URL)
    df_list[[j]] <- tryCatch(getProfile(URL), 
                             error = function(e) NULL)
    j <- j + 1
  }

final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")

0 讨论(0)