Scraping string from a large number of URLs with Julia

问题

Happy New Year!

I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs.

I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer).

I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script seems to stop after just under 2k. I can't see an error such as a timeout.

May I ask if anyone can advise what I'm doing wrong or how I can approach it differently? Explanations/links to learning resources would also be great!

using HTTP, Queryverse


URLs = load("urls.csv") |> DataFrame

patternid = r"\/page\/[0-9]+\/view"

touch("ids.txt")
f = open("ids.txt", "a")

for row in eachrow(URLs)

    urlResponse = HTTP.get(row[:url])
    if Int(urlResponse.status) == 404
        continue
    end

    urlHTML = String(urlResponse.body)

    urlIDmatch = match(patternid, urlHTML)

    write(f, urlIDmatch.match, "\n")

end

close(f)

回答1:

There can be always a server that detects your scraper and intentionally takes a very long time to respond.

Basically, since scraping is an IO intensive operations you should do it using a big number of asynchronous tasks. Moreover this should be combined with the readtimeout parameter of the get function. Hence your code will look more or less like this:

asyncmap(1:nrow(URLs);ntasks=50) do n
    row = URLs[n, :]
    urlResponse = HTTP.get(row[:url], readtimeout=10)
    # the rest of your code comes here
end

Even one some servers are delaying transmission, always many connections will be working.

来源：https://stackoverflow.com/questions/65530350/scraping-string-from-a-large-number-of-urls-with-julia

标签

web-scraping

julia