Wrong number of results in Google Scrape with Python

前端 未结 2 1162
情歌与酒
情歌与酒 2021-01-03 16:48

I was trying to learn web scraping and I am facing a freaky issue... My task is to search Google for news on a topic in a certain date range and count the number of results.

相关标签:
2条回答
  • 2021-01-03 17:10

    To add to Vikas' answer, Google will also fail to use 'custom date range' for some user-agents. That is, for certain user-agents, Google will simply search for 'recent' results instead of your specified date range.

    I haven't detected a clear pattern in which user-agents will break the custom date range. It seems that including a language is a factor.

    Here are some examples of user-agents that break cdr:

    Mozilla/5.0 (Windows; U; Windows NT 6.1; fr-FR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27

    Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)

    0 讨论(0)
  • 2021-01-03 17:15

    There are a couple of things that is causing this issue. First, it wants day and month parts of date in 2 digits and it is also expecting a user-agent string of some popular browser. Following code should work:

    import requests,  bs4
    
    headers = {
        "User-Agent":
            "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
    }
    payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
    r = requests.get("https://www.google.com/search", params=payload, headers=headers)
    
    soup = bs4.BeautifulSoup(r.content, 'html5lib')
    print soup.find(id='resultStats').text
    
    0 讨论(0)
提交回复
热议问题