问题
I am working on crawler and I have to extract data from 200-300 links on Google Scholar. I have working parser which is getting data from pages (on every pages are 1-10 people profiles as result of my query. I'm extracting proper links, go to another page and do it again). During run of my program I spotted above error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.pl/citations%3Fmauthors%3DAGH%2BUniversity%2Bof%2BScience%2Band%2BTechnology%26hl%3Dpl%26view_op%3Dsearch_authors&q=CGMSBFMKrI0YiJHfqgUiGQDxp4NLfGBv6zgPSjfyQ9LBi5F-K1EbGwQ
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537)
I know it is linked with simple google protection against robots. How I can improve my connection
Connection connection =
Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(10000)
.followRedirects(true);
to not have temporary ban? I know there is a way to check response, like this:
Connection.Response response =
Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
if (statusCode == 200) { ... }
else if (statusCode == 503) { do recconect magic}
But what should I do, when I got 503 error? Have I to use proxy? Random wait time beetween connections? I hope there is better idea than saving my results in file, do manual hard-restart of router and try with new IP :P
回答1:
You have already provided your own answers...
Have I to use proxy?
Of course. You should already have setup a bunch of proxies for your wrawling activity.
Random wait time beetween connections?
Yes. Use some random wait between 3000 and 5000 ms.
Alternatively, you could use an online captcha service resolving if you hit the URL https://ipv4.google.com/sorry/IndexRedirect...
. Don't hit it too often or you'll get banned.
Happy coding :)
来源:https://stackoverflow.com/questions/30281650/org-jsoup-httpstatusexception-http-error-fetching-url-status-503-google-schol