How to iterate pages to scrape web news

落爺英雄遲暮 提交于 2020-06-01 05:12:27

问题


I've been trying to figure out how to iterate pages to scrape multiple news articles.

This is the page I want to scrape: (and its following pages) https://www.startribune.com/search/?page=1&q=China%20COVID-19&refresh=true

I tried out the below code, but it doesn't return a correct result:

def scrap(url):
    user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
    urls = [f"{url}{x}" for x in range(1,10)]
    params = {
        'q': 'China%20COVID-19'
    }
    for page in urls:
        response = requests.get(url=page,
                                headers=user_agent,
                                params=params) 
    print(page)

print(scrap('https://www.startribune.com/search/'))

Please suggest improvements or solutions!

The results I expect are:

https://www.startribune.com/search/?page=1&q=China%20COVID-19&refresh=true 
https://www.startribune.com/search/?page=2&q=China%20COVID-19&refresh=true
...
https://www.startribune.com/search/?page=9&q=China%20COVID-19&refresh=true

回答1:


As mentioned in the comments, make sure the params are complete:

def scrap(url):
    user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
    params = {
        'q': 'China%20COVID-19',
        'refresh': 'true',
    }
    for page_no in range(1, 10):
        params['page'] = page_no
        response = requests.get(url=url,
                                headers=user_agent,
                                params=params) 
        print(response.request.url)
        # https://www.startribune.com/search/?q=China%2520COVID-19&refresh=true&page=1

scrap('https://www.startribune.com/search/')


来源:https://stackoverflow.com/questions/61938026/how-to-iterate-pages-to-scrape-web-news

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!