问题
I am trying to use rvest to scrape the results of a webform that pop up in a cgi-bin. However when I run the script I get back 0 results within 200 miles as the result. Below is my code I appreciate any feedback and help. The main website is http://www.zmax.com/ that has the search box that launches the cgi-bin.
library(rvest);
library(purrr) ;
library(plyr) ;
library(dplyr) ;
x<-read_html('http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl')
y<-x%>% html_node('table')%>% html_table(fill=true)
also I have tried
y<-x%>%
html_node('td div td, p')
%>% html_text()
I am unsure of where I am going wrong in returning the data that is on the form.
回答1:
Strangely enough, neither the main site nor the provider they use for outlet lookups prevents scraping by T&C or REP. ¯\_(ツ)_/¯
You should really get familiar with browser Developer Tools as you would have been able to see that the main site makes an HTTP POST
request to the lookup site vs the GET
request browsers normally make and that read_html()
makes. Here's what you need to do to get successful requests (we'll pick a zip code near-ish you):
library(httr)
library(rvest)
POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode = "48127"),
encode = "form"
) -> res
res
is an httr
response
object and one would normally just do:
content(res, as="parsed")
to get a parsed object ready for XML/HTML dissection. But, there are weird encoding issues (at least for me) on that site forcing us to have to do:
content(res, as="raw") %>% read_html() -> pg
You should cat(as.character(pg))
to see how ugly the HTML is. It's nested tables, but not in a good way. The entries you see there are all <tr>
elements with no <table>
breaks. Thankfully? there are only singular <td>
elements in each of those <tr>
elements. So, we can grab them all in one fell swoop by targeting the correct <table>
:
rows <- html_nodes(pg, "table[width='300'] > tr > td")
rows
## {xml_nodeset (60)}
## [1] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>O\u0092REILLY AUTO PARTS</b></fo ...
## [2] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">6938 NORTH TELEGRAPH ROAD</font></td>
## [3] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Dearborn Heights, MI 48127</font></td>
## [4] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 792-9134</font></td>
## [5] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=6938+NORTH+TELEGRAPH+R ...
## [6] <td width="300" height="6"></td>
## [7] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>Advance Auto Parts</b></font></p ...
## [8] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">8120 North Telegraph Road</font></td>
## [9] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Dearborn Heights, MI 48127</font></td>
## [10] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 528-4920</font></td>
## [11] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=8120+North+Telegraph+R ...
## [12] <td width="300" height="6"></td>
## [13] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>Pep Boys</b></font></p></td>
## [14] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">8955 TELEGRAPH RD</font></td>
## [15] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Redford, MI 48239</font></td>
## [16] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 532-5750</font></td>
## [17] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=8955+TELEGRAPH+RD+Redf ...
## [18] <td width="300" height="6"></td>
## [19] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>O\u0092REILLY AUTO PARTS</b></fo ...
## [20] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">27207 PLYMOUTH ROAD</font></td>
## ...
There are many approaches one could take to make a data frame out of that mess. One simple one involves using the fact that the store titles have a set background color while the others do not. This makes the code a bit fragile, but we can help it be less fragile by just testing for the presence of a background color. Why do we even need to do this? Well, we need to mark start and end of records and one easy way to do this is use the fact that we can cumsum()
a logical vector, knowing that it FALSE
== 0. Why does that matter? We can create an implicit grouping column that way:
data_frame(
record = !is.na(html_attr(rows, "bgcolor")),
text = html_text(rows, trim=TRUE)
) %>%
mutate(record = cumsum(record)) -> xdf
#3 # A tibble: 60 x 2
#3 record text
#3 <int> <chr>
#3 1 1 "O\u0092REILLY AUTO PARTS"
#3 2 1 6938 NORTH TELEGRAPH ROAD
#3 3 1 Dearborn Heights, MI 48127
#3 4 1 (313) 792-9134
#3 5 1 0 miles away
#3 6 1
#3 7 2 Advance Auto Parts
#3 8 2 8120 North Telegraph Road
#3 9 2 Dearborn Heights, MI 48127
#3 10 2 (313) 528-4920
#3 # ... with 50 more rows
Now, we need to remove the empty rows with filter()
and do some munging to get the data into a decent form for making a data frame. This is super fragile code in that this particular snippet can handle missing phone number data but that's about it. If there's a second address line, you'll need to modify this approach or use a different approach:
filter(xdf, text != "") %>%
group_by(record) %>%
summarise(x = paste0(text, collapse="|")) %>%
separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"), sep="\\|", extra="merge")
## # A tibble: 10 x 5
## record store address1 city_state_zip phone_and_or_distance
## * <int> <chr> <chr> <chr> <chr>
## 1 1 "O\u0092REILLY AUTO PARTS" 6938 NORTH TELEGRAPH ROAD Dearborn Heights, MI 48127 (313) 792-9134|0 miles away
## 2 2 Advance Auto Parts 8120 North Telegraph Road Dearborn Heights, MI 48127 (313) 528-4920|0 miles away
## 3 3 Pep Boys 8955 TELEGRAPH RD Redford, MI 48239 (313) 532-5750|2 miles away
## 4 4 "O\u0092REILLY AUTO PARTS" 27207 PLYMOUTH ROAD Redford, MI 48239 (313) 937-1787|2 miles away
## 5 5 "O\u0092REILLY AUTO PARTS" 14975 TELEGRAPH ROAD Redford, MI 48239 (313) 538-3584|2 miles away
## 6 6 AutoZone 24250 FIVE MILE Redford, MI 48239 (313) 527-6877|2 miles away
## 7 7 "O\u0092REILLY AUTO PARTS" 5940 MIDDLEBELT RD Garden City, MI 48135 (734) 525-1607|3 miles away
## 8 8 AutoZone 6228 MIDDLEBELT RD Garden City, MI 48135 (734) 513-2233|3 miles away
## 9 9 Advance Auto Parts 3845 S Telegraph Rd Dearborn, MI 48124 (313) 274-6549|3 miles away
## 10 10 "O\u0092REILLY AUTO PARTS" 27565 MICHIGAN AVENUE Inkster, MI 48141 (313) 724-8544|3 miles away
Just in case the process was non-obvious, we:
- group the rows by our freshly created
record
column - smush all the text into one string, each part separated with
|
's - separate out all the individual bits
That shld hopefully help explain the fragility.
Granted, you only wanted the "how to get to the content" part, but hopefully this saved you some more time.
来源:https://stackoverflow.com/questions/46747475/how-can-i-scrape-a-cgi-bin-with-rvest-and-r