问题
Happy New Year!
I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs.
I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer).
I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script seems to stop after just under 2k. I can't see an error such as a timeout.
May I ask if anyone can advise what I'm doing wrong or how I can approach it differently? Explanations/links to learning resources would also be great!
using HTTP, Queryverse
URLs = load("urls.csv") |> DataFrame
patternid = r"\/page\/[0-9]+\/view"
touch("ids.txt")
f = open("ids.txt", "a")
for row in eachrow(URLs)
urlResponse = HTTP.get(row[:url])
if Int(urlResponse.status) == 404
continue
end
urlHTML = String(urlResponse.body)
urlIDmatch = match(patternid, urlHTML)
write(f, urlIDmatch.match, "\n")
end
close(f)
回答1:
There can be always a server that detects your scraper and intentionally takes a very long time to respond.
Basically, since scraping is an IO intensive operations you should do it using a big number of asynchronous tasks. Moreover this should be combined with the readtimeout
parameter of the get
function. Hence your code will look more or less like this:
asyncmap(1:nrow(URLs);ntasks=50) do n
row = URLs[n, :]
urlResponse = HTTP.get(row[:url], readtimeout=10)
# the rest of your code comes here
end
Even one some servers are delaying transmission, always many connections will be working.
来源:https://stackoverflow.com/questions/65530350/scraping-string-from-a-large-number-of-urls-with-julia