web scraping google news with python

放肆的年华 提交于 2020-04-26 14:54:48

问题


I am creating a web scraper for different news outlets, for Nytimes and the Guardian it was easy since they have their own API.

Now, I want to scrape results from this newspaper GulfTimes.com. They do not provide an advanced search in their website, so I resorted to Google news. However, Google news Api has been deprecated. What i want is to retrieve the number of results from an advanced search like keyword = "Egypt" and begin_date="10/02/2011" and end_date="10/05/2011".

This is feasible in the Google News UI just by putting the source as "Gulf Times" and the corresponding query and date and simply counting manually the number of results but when I try to do this using python, I get a 403 error which is understandable.

Any idea on how I would do this? Or is there another service besides Google news that would allow me to do this? Keeping in mind that I would issue almost 500 requests at once.

import json
import urllib2
import cookielib
import re
from bs4 import BeautifulSoup


def run():
   Query = "Egypt"
   Month = "3"
   FromDay = "2"
   ToDay = "4"
   Year = "13"
   url='https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q='+Query+'&as_occt=any&as_drrb=b&as_mindate='+Month+'%2F'+FromDay+'%2F'+Year+'&as_maxdate='+Month+'%2F'+ToDay+'%2F'+Year+'&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'
   cj = cookielib.CookieJar()
   opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
   request = urllib2.Request(url)   
   response = opener.open(request)
   htmlFile = BeautifulSoup(response)
   print htmlFile


run()

回答1:


You can use awesome requests library:

import requests

URL = 'https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q={query}&as_occt=any&as_drrb=b&as_mindate={month}%2F%{from_day}%2F{year}&as_maxdate={month}%2F{to_day}%2F{year}&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'


def run(**params):
    response = requests.get(URL.format(**params))
    print response.content, response.status_code


run(query="Egypt", month=3, from_day=2, to_day=2, year=13)

And you'll get status_code=200.

And, btw, take a look at scrapy project. Nothing makes web-scraping more simple than this tool.




回答2:


hi you can scrap like this with easy way

from bs4 import BeautifulSoup
import requests

url="https://news.google.co.in/"
code=requests.get(url)
soup=BeautifulSoup(code.text,'html5lib')
for title in soup.find_all('span',class_="titletext"):
print title.text



回答3:


Disclosure: I work at SerpApi.


You can use google-search-results package to extract data from Google News. Check a demo at Repl.it.

from serpapi.google_search_results import GoogleSearchResults

month = 4
from_day = 2
to_day = 3
year = 2020

params = {
    "engine": "google",
    "q": "Trump",
    "google_domain": "google.com",
    "tbm": "nws",
    "tbs": f"cdr:1,cd_min:{month}/{from_day}/{year},cd_max:{month}/{to_day}/{year}",
}

client = GoogleSearchResults(params)
data = client.get_dict()

print("News results")

for result in data['news_results']:
    print(f"""
Title: {result['title']}
Snippet: {result['snippet']}
Date: {result['date']}
""")

Part of JSON response

{
  "news_results": [
    {
      "position": 1,
      "title": "Trump Promotes Oil Deal That May Not Exist",
      "link": "https://www.nytimes.com/2020/04/02/us/politics/trump-russia-saudi-arabia-oil.html",
      "source": "The New York Times",
      "date": "15 hours ago",
      "snippet": "WASHINGTON — When oil prices crashed in early March after a dispute between \nRussia and Saudi Arabia, President Trump put a positive spin on the news.",
      "thumbnail": ""
    },
    {
      "position": 2,
      "title": "Trump’s Oil Summit",
      "link": "https://www.wsj.com/articles/trumps-oil-summit-11585870063",
      "source": "Wall Street Journal",
      "date": "Opinion · 16 hours ago",
      "snippet": "Trump's Oil Summit. Tariffs and quotas won't solve a price shock caused by \na pandemic and a Saudi Arabia-Russia feud.",
      "thumbnail": ""
    }
  ]
}

Output

News results

Title: Trump Promotes Oil Deal That May Not Exist
Snippet: WASHINGTON — When oil prices crashed in early March after a dispute between 
Russia and Saudi Arabia, President Trump put a positive spin on the news.
Date: 15 hours ago


Title: Trump’s Oil Summit
Snippet: Trump's Oil Summit. Tariffs and quotas won't solve a price shock caused by 
a pandemic and a Saudi Arabia-Russia feud.
Date: Opinion · 16 hours ago


Title: OPEC and allies reportedly set for video meeting as analysts pour 
skepticism on Trump's intervention
Snippet: “Donald Trump's tweet … It's nonsense, really,” Patrick Armstrong, chief 
investment officer at Plurimi Investment Managers, told CNBC's “Squawk Box 
Europe” on ...
Date: 5 hours ago


Title: Trump again tests negative for coronavirus
Snippet: President Donald Trump on Thursday again tested negative for the 
coronavirus after being tested by the White House physician, according to 
two White House ...
Date: 17 hours ago


Title: Trump passes the buck as deadly ventilator shortage looms
Snippet: (CNN) President Donald Trump is pinning the blame on states for a shortage 
of ventilators that governors warn could effectively condemn coronavirus 
patients to ...
Date: 10 hours ago

If you want more information, check out SerpApi documentation or live playground.



来源:https://stackoverflow.com/questions/15550655/web-scraping-google-news-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!