Using R for webscraping: HTTP error 503 despite using long pauses in program

问题

I'm trying to search the ProQuest Archiver using R. I'm interested in finding the number of articles for a newspaper containing a certain keyword.

It generally works well using the rvest tool. However, the program sometimes breaks down. See this minimal example:

library(xml2)
library(rvest)

# Retrieve the title of the first search hit on the page of search results
for (p in seq(0, 150, 10)) {
  searchURL <- paste("http://pqasb.pqarchiver.com/djreprints/results.html?st=advanced&QryTxt=bankruptcy&sortby=CHRON&datetype=6&frommonth=01&fromday=01&fromyear=1908&tomonth=12&today=31&toyear=1908&By=&Title=&at_hist=article&at_hist=editorial_article&at_hist=front_page&type=historic&start=", p, sep="")
  htmlWeb <- read_html(searchURL)
  nodeWeb <- html_node(htmlWeb, ".text tr:nth-child(1) .result_title a") 
  textWeb <- html_text(nodeWeb)
  print(textWeb)
  Sys.sleep(0.1)
}

This works for me sometimes. But if I run this or similar scripts a couple of times, it breaks down at the same point and I get an error on the 12th iteration (p=120):

Error in open.connection(x, "rb") : HTTP error 503.

I tried circumventing this by putting in pauses of escalating lengths, but that doesn't help.

I've also considered:

saving which result pages cannot be reached and writing separate scripts for those cases
changing my IP some time through the program?
quit and start R some time through the program?

I thank you for any comments.

回答1:

Try being a bit more human-like in the delays. This works for me (multiple tries):

library(xml2)
library(httr)
library(rvest)
library(purrr)
library(dplyr)

to_get <- seq(0, 150, 10)
pb <- progress_estimated(length(to_get))

map_chr(to_get, function(i) {
  pb$tick()$print()
  searchURL <- paste("http://pqasb.pqarchiver.com/djreprints/results.html?st=advanced&QryTxt=bankruptcy&sortby=CHRON&datetype=6&frommonth=01&fromday=01&fromyear=1908&tomonth=12&today=31&toyear=1908&By=&Title=&at_hist=article&at_hist=editorial_article&at_hist=front_page&type=historic&start=", i, sep="")
  htmlWeb <- read_html(searchURL)
  nodeWeb <- html_node(htmlWeb, "td > font.result_title > a")
  textWeb <- html_text(nodeWeb)
  Sys.sleep(sample(10, 1) * 0.1)
  textWeb
}) -> titles

print(trimws(titles))

##  [1] "NEWSPAPER SPECIALS."                                      
##  [2] "NEWSPAPER SPECIALS."                                      
##  [3] "New Jersey Ice Co. Insolvent."                            
##  [4] "NEWSPAPER SPECIALS."                                      
##  [5] "NEWSPAPER SPECIALS"                                       
##  [6] "AMERICAN ICE BEGINNING BUSY SEASON IN IMPROVED CONDITION."
##  [7] "NEWSPAPER SPECIALS"                                       
##  [8] "THE GERMAN REICHSBANK."                                   
##  [9] "U.S. Exploration Co. Bankrupt."                           
## [10] "CHICAGO TRACTION."                                        
## [11] "INCREASING FREIGHT RATES."                                
## [12] "A.O. BROWN & CO."                                         
## [13] "BROAD STREET GOSSIP"                                      
## [14] "Meadows, Williams & Co."                                  
## [15] "FAILURES IN OCTOBER."                                     
## [16] "Supplementary Receiver for Heinze & Co."

I randomized the sleep call value, simplified the CSS target a bit, added a progress bar and automagically made a vector. You prbly ultimately want a data.frame from this data, so ?purrr::map_df for that.

回答2:

In the end, we use a combination of:

random pauses,
randomly changing the user agent (as suggested by hrmrmstr's comment) and
trying multiple times when a URL access returns an error.

It still happens that we cannot fully access all URLs and in that case we just save the info on where that happened and go on.

Thanks for your comments!

来源：https://stackoverflow.com/questions/38119447/using-r-for-webscraping-http-error-503-despite-using-long-pauses-in-program

标签

web-scraping