Wrong number of results in Google Scrape with Python

前端未结

关注

 2  1162

I was trying to learn web scraping and I am facing a freaky issue... My task is to search Google for news on a topic in a certain date range and count the number of results.

相关标签:

2条回答

生来不讨喜

2021-01-03 17:10

To add to Vikas' answer, Google will also fail to use 'custom date range' for some user-agents. That is, for certain user-agents, Google will simply search for 'recent' results instead of your specified date range.

I haven't detected a clear pattern in which user-agents will break the custom date range. It seems that including a language is a factor.

Here are some examples of user-agents that break cdr:

Mozilla/5.0 (Windows; U; Windows NT 6.1; fr-FR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27

Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)

0 讨论(0)
发布评论:

提交评论
- 加载中...

长情又很酷

2021-01-03 17:15

There are a couple of things that is causing this issue. First, it wants day and month parts of date in 2 digits and it is also expecting a user-agent string of some popular browser. Following code should work:

import requests,  bs4

headers = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)

soup = bs4.BeautifulSoup(r.content, 'html5lib')
print soup.find(id='resultStats').text

0 讨论(0)