I am trying to scrape the following page with requests
and BeautifulSoup
/Lxml
https://www.reuters.com/search/news?blob=s
Here's a quick script should show how this can be done with Selenium:
from selenium import webdriver
import time
url = "https://www.reuters.com/search/news?blob=soybean&sortBy=date&dateRange=all"
driver = webdriver.PhantomJS()
driver.get(url)
html = driver.page_source.encode('utf-8')
page_num = 0
while driver.find_elements_by_css_selector('.search-result-more-txt'):
driver.find_element_by_css_selector('.search-result-more-txt').click()
page_num += 1
print("getting page number "+str(page_num))
time.sleep(1)
html = driver.page_source.encode('utf-8')
I don't know how to do this with requests
. There seems to be lots of articles about soybeans on Reuters. I've already done over 250 "page loads" as I finish writing this answer.
Once you scrape all, or some large amount of pages, you can then scrape the data by passing html
into Beautiful Soup:
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', attrs={"class":'search-result-indiv'})
articles = [a.find('a')['href'] for a in links if a != '']