How can I Scrape a CGI-Bin with rvest and R?

问题

I am trying to use rvest to scrape the results of a webform that pop up in a cgi-bin. However when I run the script I get back 0 results within 200 miles as the result. Below is my code I appreciate any feedback and help. The main website is http://www.zmax.com/ that has the search box that launches the cgi-bin.

library(rvest); 
library(purrr) ;
library(plyr) ;
library(dplyr) ;

x<-read_html('http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl') 

y<-x%>% html_node('table')%>% html_table(fill=true)

also I have tried

y<-x%>% 
html_node('td div td, p')
%>% html_text()

I am unsure of where I am going wrong in returning the data that is on the form.

回答1:

Strangely enough, neither the main site nor the provider they use for outlet lookups prevents scraping by T&C or REP. ¯\_(ツ)_/¯

You should really get familiar with browser Developer Tools as you would have been able to see that the main site makes an HTTP POST request to the lookup site vs the GET request browsers normally make and that read_html() makes. Here's what you need to do to get successful requests (we'll pick a zip code near-ish you):

library(httr)
library(rvest)

POST(
  url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl", 
  body = list(zipcode = "48127"), 
  encode = "form"
) -> res

res is an httr response object and one would normally just do:

content(res, as="parsed")

to get a parsed object ready for XML/HTML dissection. But, there are weird encoding issues (at least for me) on that site forcing us to have to do:

content(res, as="raw") %>% read_html() -> pg

You should cat(as.character(pg)) to see how ugly the HTML is. It's nested tables, but not in a good way. The entries you see there are all <tr> elements with no <table> breaks. Thankfully? there are only singular <td> elements in each of those <tr> elements. So, we can grab them all in one fell swoop by targeting the correct <table>:

rows <- html_nodes(pg, "table[width='300'] > tr > td")
rows
## {xml_nodeset (60)}
##  [1] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>O\u0092REILLY AUTO PARTS</b></fo ...
##  [2] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">6938 NORTH TELEGRAPH ROAD</font></td>
##  [3] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Dearborn Heights, MI  48127</font></td>
##  [4] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 792-9134</font></td>
##  [5] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=6938+NORTH+TELEGRAPH+R ...
##  [6] <td width="300" height="6"></td>
##  [7] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>Advance Auto Parts</b></font></p ...
##  [8] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">8120 North Telegraph Road</font></td>
##  [9] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Dearborn Heights, MI  48127</font></td>
## [10] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 528-4920</font></td>
## [11] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=8120+North+Telegraph+R ...
## [12] <td width="300" height="6"></td>
## [13] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>Pep Boys</b></font></p></td>
## [14] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">8955 TELEGRAPH RD</font></td>
## [15] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Redford, MI  48239</font></td>
## [16] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 532-5750</font></td>
## [17] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=8955+TELEGRAPH+RD+Redf ...
## [18] <td width="300" height="6"></td>
## [19] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>O\u0092REILLY AUTO PARTS</b></fo ...
## [20] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">27207 PLYMOUTH ROAD</font></td>
## ...

There are many approaches one could take to make a data frame out of that mess. One simple one involves using the fact that the store titles have a set background color while the others do not. This makes the code a bit fragile, but we can help it be less fragile by just testing for the presence of a background color. Why do we even need to do this? Well, we need to mark start and end of records and one easy way to do this is use the fact that we can cumsum() a logical vector, knowing that it FALSE == 0. Why does that matter? We can create an implicit grouping column that way:

data_frame(
  record = !is.na(html_attr(rows, "bgcolor")),
  text = html_text(rows, trim=TRUE)
) %>% 
  mutate(record = cumsum(record)) -> xdf
#3 # A tibble: 60 x 2
#3    record                        text
#3     <int>                       <chr>
#3  1      1  "O\u0092REILLY AUTO PARTS"
#3  2      1   6938 NORTH TELEGRAPH ROAD
#3  3      1 Dearborn Heights, MI  48127
#3  4      1              (313) 792-9134
#3  5      1                0 miles away
#3  6      1                            
#3  7      2          Advance Auto Parts
#3  8      2   8120 North Telegraph Road
#3  9      2 Dearborn Heights, MI  48127
#3 10      2              (313) 528-4920
#3 # ... with 50 more rows

Now, we need to remove the empty rows with filter() and do some munging to get the data into a decent form for making a data frame. This is super fragile code in that this particular snippet can handle missing phone number data but that's about it. If there's a second address line, you'll need to modify this approach or use a different approach:

filter(xdf, text != "") %>% 
  group_by(record) %>% 
  summarise(x = paste0(text, collapse="|")) %>% 
  separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"), sep="\\|", extra="merge")
## # A tibble: 10 x 5
##    record                      store                  address1              city_state_zip       phone_and_or_distance
##  *  <int>                      <chr>                     <chr>                       <chr>                       <chr>
##  1      1 "O\u0092REILLY AUTO PARTS" 6938 NORTH TELEGRAPH ROAD Dearborn Heights, MI  48127 (313) 792-9134|0 miles away
##  2      2         Advance Auto Parts 8120 North Telegraph Road Dearborn Heights, MI  48127 (313) 528-4920|0 miles away
##  3      3                   Pep Boys         8955 TELEGRAPH RD          Redford, MI  48239 (313) 532-5750|2 miles away
##  4      4 "O\u0092REILLY AUTO PARTS"       27207 PLYMOUTH ROAD          Redford, MI  48239 (313) 937-1787|2 miles away
##  5      5 "O\u0092REILLY AUTO PARTS"      14975 TELEGRAPH ROAD          Redford, MI  48239 (313) 538-3584|2 miles away
##  6      6                   AutoZone           24250 FIVE MILE          Redford, MI  48239 (313) 527-6877|2 miles away
##  7      7 "O\u0092REILLY AUTO PARTS"        5940 MIDDLEBELT RD      Garden City, MI  48135 (734) 525-1607|3 miles away
##  8      8                   AutoZone        6228 MIDDLEBELT RD      Garden City, MI  48135 (734) 513-2233|3 miles away
##  9      9         Advance Auto Parts       3845 S Telegraph Rd         Dearborn, MI  48124 (313) 274-6549|3 miles away
## 10     10 "O\u0092REILLY AUTO PARTS"     27565 MICHIGAN AVENUE          Inkster, MI  48141 (313) 724-8544|3 miles away

Just in case the process was non-obvious, we:

group the rows by our freshly created record column
smush all the text into one string, each part separated with |'s
separate out all the individual bits

That shld hopefully help explain the fragility.

Granted, you only wanted the "how to get to the content" part, but hopefully this saved you some more time.

来源：https://stackoverflow.com/questions/46747475/how-can-i-scrape-a-cgi-bin-with-rvest-and-r

标签

web-scraping

rvest

httr