问题
I've been trying to figure out how to iterate pages to scrape multiple news articles.
This is the page I want to scrape: (and its following pages) https://www.startribune.com/search/?page=1&q=China%20COVID-19&refresh=true
I tried out the below code, but it doesn't return a correct result:
def scrap(url):
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
urls = [f"{url}{x}" for x in range(1,10)]
params = {
'q': 'China%20COVID-19'
}
for page in urls:
response = requests.get(url=page,
headers=user_agent,
params=params)
print(page)
print(scrap('https://www.startribune.com/search/'))
Please suggest improvements or solutions!
The results I expect are:
https://www.startribune.com/search/?page=1&q=China%20COVID-19&refresh=true
https://www.startribune.com/search/?page=2&q=China%20COVID-19&refresh=true
...
https://www.startribune.com/search/?page=9&q=China%20COVID-19&refresh=true
回答1:
As mentioned in the comments, make sure the params
are complete:
def scrap(url):
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
params = {
'q': 'China%20COVID-19',
'refresh': 'true',
}
for page_no in range(1, 10):
params['page'] = page_no
response = requests.get(url=url,
headers=user_agent,
params=params)
print(response.request.url)
# https://www.startribune.com/search/?q=China%2520COVID-19&refresh=true&page=1
scrap('https://www.startribune.com/search/')
来源:https://stackoverflow.com/questions/61938026/how-to-iterate-pages-to-scrape-web-news