问题
i had made a screen scraping module which works very fine but with certain limitations.now i want to remove those boundations,but i got so unpredictable and different error. Before anything goes in ur mind let me wat is actually hapening. Initially i used screen scraping to retrieve result for a set of keyword(search content) google's all search engine like co.in/co.uk/nl/de/com.
But now i had to scrape the logic for multiple search engine and multiple keywords in a loop.
Lets check out this with an example:
keyword se company rank
telephony google.co.in airtel 01
telephony google.co.in bsnl 04
telephony google.co.in aircel 06
telephony google.co.in idea 03
mobile op google.co.uk airtel 09
mobile op google.co.uk bsnl 04
and so.. for more than 6 keywords and all shown search engines and for all company.
Initially i was retreiving it for one keyword,se and all company.but now i have to make a list of all keywords,se,company. Simply i used loops to do that.But i faced these errors:
- memory allocated 343322111 bytes overflowed(...[to remove this i used ini_set('memory') func]
- after sum request google used capcha.
To remove capcha i used sleep, or usleep() but it not solving purpose.atlast ERROR: connection reset.
I cant use 30sec or more in usleep func.it will take hours to retreive info.My code search data for 5pages of google, that means 50responses.Lib using
simple_html_dom.php
It works fine for 1page page but not for greater than 3pages.What should i do/use??
回答1:
The Captcha is Googles way to say you that they found you using it commercially, and want you to use their paid service from now on http://code.google.com/intl/en/apis/customsearch/v1/overview.html
As for the memory problem, we can't help you without some code overview. (But to conserve some at least split out the keywords instead of keeping complete pages or DOM parse trees around.)
回答2:
It doesn't matter what time limit you're gonna use - it won't solve your problem. What you need to do is either use their API, which is inconsistent with the real results you see, or sign up for 100 proxies and iterate thru them all in a round robin format. You can easily scrape Google 24/7 with 100 proxies or so, and it just costs $100. Make sure you clear cookies after every request, and set a good user agent (nothing dumb that makes Google think you're a bad bot).
I rather do that than pay for their API which gives you X amount of calls and wastes your money. Yes, I know technically it's against their TOS, but it seems like what you're doing is harmless.
回答3:
sleep()
function with &num=100
in query solves the problem. Using &num=100 reduces the number of request to google 10times. and between every request i used 5 sec delay which google seems to be a valid,genuine,human request.
回答4:
Instead of selecting the first 5 pages of 10 results select 1 page of 50 results!
Make sure you use a typical user-agent so you don't look like a bot. To make yourself look less suspicious also follow some of the result links occasionally using Google's redirect URL, like a real user would.
You can also rent proxies, but the above techniques should be sufficient for most cases.
来源:https://stackoverflow.com/questions/5513083/screen-scraping-in-php-problem