问题
I am creating a web scraper for different news outlets, for Nytimes and the Guardian it was easy since they have their own API.
Now, I want to scrape results from this newspaper GulfTimes.com. They do not provide an advanced search in their website, so I resorted to Google news. However, Google news Api has been deprecated. What i want is to retrieve the number of results from an advanced search like keyword = "Egypt" and begin_date="10/02/2011" and end_date="10/05/2011".
This is feasible in the Google News UI just by putting the source as "Gulf Times" and the corresponding query and date and simply counting manually the number of results but when I try to do this using python, I get a 403 error which is understandable.
Any idea on how I would do this? Or is there another service besides Google news that would allow me to do this? Keeping in mind that I would issue almost 500 requests at once.
import json
import urllib2
import cookielib
import re
from bs4 import BeautifulSoup
def run():
Query = "Egypt"
Month = "3"
FromDay = "2"
ToDay = "4"
Year = "13"
url='https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q='+Query+'&as_occt=any&as_drrb=b&as_mindate='+Month+'%2F'+FromDay+'%2F'+Year+'&as_maxdate='+Month+'%2F'+ToDay+'%2F'+Year+'&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)
response = opener.open(request)
htmlFile = BeautifulSoup(response)
print htmlFile
run()
回答1:
You can use awesome requests library:
import requests
URL = 'https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q={query}&as_occt=any&as_drrb=b&as_mindate={month}%2F%{from_day}%2F{year}&as_maxdate={month}%2F{to_day}%2F{year}&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'
def run(**params):
response = requests.get(URL.format(**params))
print response.content, response.status_code
run(query="Egypt", month=3, from_day=2, to_day=2, year=13)
And you'll get status_code=200.
And, btw, take a look at scrapy project. Nothing makes web-scraping more simple than this tool.
回答2:
hi you can scrap like this with easy way
from bs4 import BeautifulSoup
import requests
url="https://news.google.co.in/"
code=requests.get(url)
soup=BeautifulSoup(code.text,'html5lib')
for title in soup.find_all('span',class_="titletext"):
print title.text
回答3:
Disclosure: I work at SerpApi.
You can use google-search-results package to extract data from Google News. Check a demo at Repl.it.
from serpapi.google_search_results import GoogleSearchResults
month = 4
from_day = 2
to_day = 3
year = 2020
params = {
"engine": "google",
"q": "Trump",
"google_domain": "google.com",
"tbm": "nws",
"tbs": f"cdr:1,cd_min:{month}/{from_day}/{year},cd_max:{month}/{to_day}/{year}",
}
client = GoogleSearchResults(params)
data = client.get_dict()
print("News results")
for result in data['news_results']:
print(f"""
Title: {result['title']}
Snippet: {result['snippet']}
Date: {result['date']}
""")
Part of JSON response
{
"news_results": [
{
"position": 1,
"title": "Trump Promotes Oil Deal That May Not Exist",
"link": "https://www.nytimes.com/2020/04/02/us/politics/trump-russia-saudi-arabia-oil.html",
"source": "The New York Times",
"date": "15 hours ago",
"snippet": "WASHINGTON — When oil prices crashed in early March after a dispute between \nRussia and Saudi Arabia, President Trump put a positive spin on the news.",
"thumbnail": ""
},
{
"position": 2,
"title": "Trump’s Oil Summit",
"link": "https://www.wsj.com/articles/trumps-oil-summit-11585870063",
"source": "Wall Street Journal",
"date": "Opinion · 16 hours ago",
"snippet": "Trump's Oil Summit. Tariffs and quotas won't solve a price shock caused by \na pandemic and a Saudi Arabia-Russia feud.",
"thumbnail": ""
}
]
}
Output
News results
Title: Trump Promotes Oil Deal That May Not Exist
Snippet: WASHINGTON — When oil prices crashed in early March after a dispute between
Russia and Saudi Arabia, President Trump put a positive spin on the news.
Date: 15 hours ago
Title: Trump’s Oil Summit
Snippet: Trump's Oil Summit. Tariffs and quotas won't solve a price shock caused by
a pandemic and a Saudi Arabia-Russia feud.
Date: Opinion · 16 hours ago
Title: OPEC and allies reportedly set for video meeting as analysts pour
skepticism on Trump's intervention
Snippet: “Donald Trump's tweet … It's nonsense, really,” Patrick Armstrong, chief
investment officer at Plurimi Investment Managers, told CNBC's “Squawk Box
Europe” on ...
Date: 5 hours ago
Title: Trump again tests negative for coronavirus
Snippet: President Donald Trump on Thursday again tested negative for the
coronavirus after being tested by the White House physician, according to
two White House ...
Date: 17 hours ago
Title: Trump passes the buck as deadly ventilator shortage looms
Snippet: (CNN) President Donald Trump is pinning the blame on states for a shortage
of ventilators that governors warn could effectively condemn coronavirus
patients to ...
Date: 10 hours ago
If you want more information, check out SerpApi documentation or live playground.
来源:https://stackoverflow.com/questions/15550655/web-scraping-google-news-with-python