问题
I am wanting to fill in a web form and submit my query and download the resulting data. Some of the fields have the option of a drop-down menu or typing in a search query, sections can also be left blank (if all sections are left blank the entire database is downloaded), hitting the "search and download" button should instigate the downloading of a file.
Here is what I have tried (selecting all records for species "Salmo salar") based on this question. I used my browser (Opera) "Developer Tools" to inspect page elements and identify the names of all the possible fields:
library(httr)
url <- "https://nzffdms.niwa.co.nz/search"
fd <- list(
search_catchment_no_name = "",
search_river_lake = "",
search_sampling_locality = "",
search_fishing_method = "",
search_start_year = "",
search_end_year = "",
search_species = "Salmo salar", # species of interest
search_download_format = 1, # select csv file format
submit = "Search and Download"
)
POST(url, body = fd, encode = "form")
I had hoped this would result in a csv file being downloaded (all records for species "Salmo salar"), but no file downloads (but outputs this (list of 10, just showing the first bit):
Response [https://nzffdms.niwa.co.nz/search]
Date: 2019-10-02 23:35
Status: 200
Content-Type: text/html; charset=utf-8
Size: 19.1 kB
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; c...
<meta name="title" content="NZ Freshwater Fish Database...
<meta name="description" content="NIWA NZ Freshwater Fish...
<meta name="keywords" content="NIWA, NZ, Freshwater Fish" />
<meta name="language" content="en" />
<meta name="robots" content="index, follow />
...
Edit
I think the issue is with how I am calling the Search and download
button, when inspecting the web-page most fields look like this:
# end year field
<input maxlength="4" class="form-control" type="text" name="search[end_year]" id="search_end_year">
But the search and download
button elements don't have a name
or id
option:
<input type="submit" value="Search and Download" class="btn btn-primary btn-md">
Also I have just noticed there is a hidden field, maybe I need to define this?
<input type="hidden" name="search[_csrf_token]" value="d1530f09c1ce8110b5163bd100cb0d67" id="search__csrf_token">
Any advice on how I can get the file downloading would be much appreciated.
回答1:
First, check robots.txt on the website. It is commented out as of Oct 3, 2019.
Then read the terms and conditions on https://nzffdms.niwa.co.nz/terms and https://www.niwa.co.nz/freshwater-and-estuaries/nzffd/user-guide/tips and make sure you obey the terms and conditions.
And it is also important to throttle the request below.
After checking all the terms and conditions, you can use the code below to query for your data:
library(httr)
library(xml2)
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(content(gr, "text")) #doc <- read_html(gr) #this works as well
getTbl <- function(x) {
do.call(rbind, lapply(xml_find_all(doc, paste0(".//select[@name='search",x,"']/option")),
function(n) data.frame(NAME=xml_text(n), VALUE=xml_attr(n, "value"))))
}
fishing_method <- getTbl("[fishing_method]")
species <- getTbl("[species][]")
csrf_token <- xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")
fd <- list(
"search[catchment_no_name]"="",
"search[river_lake]"="",
"search[sampling_locality]"="",
"search[fishing_method]"="",
"search[species][]"="",
"search[species][]"=68,
"search[start_year]"="",
"search[end_year]"="",
"search[download_format]"="1",
"search[_csrf_token]"=csrf_token
)
r <- POST("https://nzffdms.niwa.co.nz/doSearch", body=fd, encode="form")
read.csv(text=content(r, "text", encoding="UTF-8"))
output:
card m y catchname catch locality time org map east north altitude penet fishmeth effort pass spcode abund number minl maxl nzreach
1 3964 1 1981 Waiau R 797.49 Lake Gunn NA niwa d41 2122400 5581200 477 225 ang NA NA salsal NA NA NA NA 15006671
2 3965 1 1981 Waiau R 797.49 Lake Fergus NA niwa d41 2123700 5584400 483 229 ang NA NA salsal NA NA NA NA 15006092
3 15975 1 2003 Waiau R 797.40 Excelsior Creek 1330 niwa d44 2095800 5495800 190 94 efp 80 1 salsal NA 2 102 105 15030686
4 50772 1 1940 Waiau R 797.49 Upukerora River NA unk d43 2098500 5519900 210 146 unk NA NA salsal NA NA NA NA 15020897
来源:https://stackoverflow.com/questions/58159645/fill-in-web-form-submit-and-download-results