I was trying to learn web scraping and I am facing a freaky issue... My task is to search Google for news on a topic in a certain date range and count the number of results.
To add to Vikas' answer, Google will also fail to use 'custom date range' for some user-agents. That is, for certain user-agents, Google will simply search for 'recent' results instead of your specified date range.
I haven't detected a clear pattern in which user-agents will break the custom date range. It seems that including a language is a factor.
Here are some examples of user-agents that break cdr
:
Mozilla/5.0 (Windows; U; Windows NT 6.1; fr-FR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27
Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)
There are a couple of things that is causing this issue. First, it wants day and month parts of date in 2 digits and it is also expecting a user-agent string of some popular browser. Following code should work:
import requests, bs4
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)
soup = bs4.BeautifulSoup(r.content, 'html5lib')
print soup.find(id='resultStats').text