问题
I'm using RSelenium on my MacBook to scrape publicly available .csv files. None of the other questions posed so far had answers that were particularly helpful for me. Please don't mark this as a duplicate.
With respect to Firefox, I can't disable the dialog box. I've tried a number of different things.
According to Firefox, the MIME type of the file I'm trying to download text/csv; charset=UTF-8
. However, executing the following code still elicits the dialog box to appear:
fprof <- makeFirefoxProfile(list(browser.download.dir = "~/Scrape"
,browser.download.folderList = 2L
,browser.download.manager.showWhenStarting = FALSE
,browser.download.manager.showAlertOnComplete = FALSE
,browser.helperApps.neverAsk.openFile = "text/csv; charset=UTF-8"
,browser.helperApps.neverAsk.saveToDisk = "text/csv; charset=UTF-8"))
browser <- remoteDriver(port = 5556, browserName = "firefox", extraCapabilities = fprof)
I've tried a number of different edits, including editing the MIME to text/csv
as well as application/octet-stream
. Neither work. I've created a Firefox profile with features already in to avoid the dialog box. That's also no luck.
I tried moving to Chrome, but alas...there, I encounter another issue. After 100 items, Chrome won't allow me to automatically download the files. My scraping function is rather complex and the only solution posted to a similar type problem wasn't very clear.
I define the following capabilities for Chrome, but it doesn't disable the 100 limit on downloads.
eCaps <- list(
chromeOptions =
list(prefs = list(
"profile.default_content_settings.popups" = 0L,
"download.prompt_for_download" = FALSE,
"download.default_directory" = "~/Desktop/WebScrape"
)
)
)
browser <- remoteDriver(port = 5556, browserName = "chrome",extraCapabilities = eCaps)
I'm happy to take any suggestions. I've spent hours trying to figure this issue out. Any help is appreciated.
Edit: to provide more details, I'm a researcher and PhD Candidate interested in criminal justice reform. I'm pulling data from http://casesearch.courts.state.md.us/casesearch/ to examine cases of different types and jurisdictions in Maryland. A data request submitted to the Circuit Court has been accepted; however, the custodians may not be able to provide it to me in a reasonable time (up to several months). Therefore, I am scraping the data myself.
The code I wrote so far automatically gets through the terms and conditions page, types in a letter of the alphabet - say A, selects Circuit Court only, chooses a set of dates, picks a jurisdiction, and then searches for all cases. At the bottom of the page, there is an option to download the records in .csv form. I have the code clicking this. I condition all of my code on the presence of error messages. If these error messages pop up, then I go back and update the dates until the message goes away.
Chrome limits me to 100 downloads. Since I posted the code earlier today, I have the records synthesized into a larger .csv file and then delete all similarly named files once it reaches the end of the search dates I have chosen for a particular letter of the alphabet. This will work for most counties. I will run into issues with the Circuit Courts of Anne Arundel County, Baltimore city, Baltimore County, Howard County, and Montgomery County; in those jurisdictions, I essentially have to download per day records given the level of policing and crime in those counties. That means thousands of .csv files. The Chrome limit really makes it cumbersome.
If someone can help me purge this dialog box issue from my R code, then I would be very thankful. I am sure others have or will have the same question.
回答1:
I recall answering a question or two for a really similar state legal portal site thing like this but they may be slightly different. I will also 100% disagree that this is not a duplicate question. The way you chose to attack the problem may be somewhat novel (it's not but you get what I mean) but just b/c you chose a bad way to attack it doesn't mean the actual thing isn't a dup of 100 other questions about iterative scraping and maintaining state.
So, first off: Selenium is 100% not necessary.
Second: that site has a ridiculously small session timeout which may be a factor in why you get an error. That dialog may still "appear" in what I'm showing below but I'll address one possible way to work around it if you do.
We just need to prime httr
verbs to act like a browser and use the underlying libcurl
library/curl
package's capability to hold a session to get what you want.
The following is modestly annotated but you've figured out Selenium so you are actually all kinds of amazing and I'm going to leave it sparse unless you want more info in each step. The basic idiom is to:
- prime the session
- fill in the form (and this digital equivalent binds you to the same in-person click so it's cool)
- start a search
- on the results page
- find the CSV link & download the file
- find the "Next" link and go to it
do the last thing in an iteration for as many times as you need.
library(httr)
# Start scraping ----------------------------------------------------------
httr::GET( # setup cookies & session
url = "http://casesearch.courts.state.md.us/casesearch/",
httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0"),
httr::accept("text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
verbose() # remove when done monitoring
) -> res
# Say "yes" ---------------------------------------------------------------
httr::POST( # say "yes" to the i agree
url = "http://casesearch.courts.state.md.us/casesearch/processDisclaimer.jis",
httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0"),
httr::accept("text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
httr::add_headers(
`Referer` = 'http://casesearch.courts.state.md.us/casesearch/'
),
body = list(
disclaimer = "Y",
action = "Continue"
),
encode = "form",
verbose() # remove when done monitoring
) -> res
# Search! -----------------------------------------------------------------
httr::POST( # search!
url = "http://casesearch.courts.state.md.us/casesearch/inquirySearch.jis",
httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0"),
httr::accept("text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
httr::add_headers(
`Referer` = 'http://casesearch.courts.state.md.us/casesearch/inquirySearch.jis'
),
body = list(
lastName = "SMITH",
firstName = "",
middleName = "",
partyType = "",
site = "00",
courtSystem = "B",
countyName = "",
filingStart = "",
filingEnd = "",
filingDate = "",
company = "N",
action = "Search"
),
encode = "form",
verbose() # remove when done monitoring
) -> res
# Get CSV URL and download it ---------------------------------------------
pg <- httr::content(res)
html_nodes(pg, xpath=".//span[contains(@class, 'export csv')]/..") %>%
html_attr("href") -> csv_url
httr::GET(
url = sprintf("http://casesearch.courts.state.md.us/%s", csv_url),
httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0"),
httr::write_disk("some-file-name-you-increment-01.csv")
)
# Get the Next URL and go to it -------------------------------------------
html_nodes(pg, xpath=".//a[contains(., 'Next')]")[1] %>%
html_attr("href") -> next_url
httr::GET(
url = sprintf("http://casesearch.courts.state.md.us/%s", next_url),
httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0")
) -> res
# Get CSV … lather / rinse / repeat ---------------------------------------
pg <- httr::content(res)
html_nodes(pg, xpath=".//span[contains(@class, 'export csv')]/..") %>%
html_attr("href") -> csv_url
httr::GET(
url = sprintf("http://casesearch.courts.state.md.us/%s", csv_url),
httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0"),
httr::write_disk("some-file-name-you-increment-02.csv")
)
# Prbly put ^^ in an iterator ---------------------------------------------
So, as I said, the site is pretty aggressive with regard to sessions. You can test for a non-search-results page or test for a re-ack page and then do the same basic POST
to re-submit and refresh the session. Also, in my work through it there was a query parameter: d-16544-p=2
and the 2
after the =
is the page number so you may just be able to use it (or whatever it gives you for the increment variable) and start with the last page caught (so you'd need to keep track of that).
来源:https://stackoverflow.com/questions/53296807/disable-dialog-box-save-as-rselenium