Scrape page with “load more results” button

前端 未结 1 803
心在旅途
心在旅途 2021-01-03 05:52

I am trying to scrape the following page with requests and BeautifulSoup/Lxml

https://www.reuters.com/search/news?blob=s

相关标签:
1条回答
  • 2021-01-03 06:55

    Here's a quick script should show how this can be done with Selenium:

    from selenium import webdriver
    import time
    
    url = "https://www.reuters.com/search/news?blob=soybean&sortBy=date&dateRange=all"
    driver = webdriver.PhantomJS()
    driver.get(url)
    html = driver.page_source.encode('utf-8')
    page_num = 0
    
    while driver.find_elements_by_css_selector('.search-result-more-txt'):
        driver.find_element_by_css_selector('.search-result-more-txt').click()
        page_num += 1
        print("getting page number "+str(page_num))
        time.sleep(1)
    
    html = driver.page_source.encode('utf-8')
    

    I don't know how to do this with requests. There seems to be lots of articles about soybeans on Reuters. I've already done over 250 "page loads" as I finish writing this answer.

    Once you scrape all, or some large amount of pages, you can then scrape the data by passing html into Beautiful Soup:

    soup = BeautifulSoup(html, 'lxml')
    links = soup.find_all('div', attrs={"class":'search-result-indiv'})
    articles = [a.find('a')['href'] for a in links if a != '']
    
    0 讨论(0)
提交回复
热议问题