Scraping a specific website with a search box and javascripts in Python

。_饼干妹妹 提交于 2021-02-11 14:30:22

问题


On the website https://sray.arabesque.com/dashboard there is a search box "input" in html. I want to enter a company name in the search box, choose the first suggestion for that name in the dropout menu (e.g., "Anglo American plc"), go to the url with the info about that company, load javascripts to get full html version of the obtained page, and then scrape it for GC Score, ESG Score, Temperature Score in the bottom.

!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')

wd = webdriver.Chrome('chromedriver',options=options)

companies = ['Anglo American plc']

for company in companies:
  # dryscrape.start_xvfb()
  # session = dryscrape.Session()
  # session.visit("https://srayapi.arabesque.com/api/sray/company/history/004BTP-E")
  resp = wd.get('https://sray.arabesque.com/dashboard/')
#print(driver.page_source)
  e = wd.find_element_by_id(id_='mat-input-0')
  e.send_keys(company)
  e.send_keys(Keys.ENTER)
  innerHTML = e.execute_script("return document.body.innerHTML")
  print(innerHTML)

I don't quite understand how to visit an URL with info about Anglo American and scrape it if we don't know the URL after entering the company name in the search box.


回答1:


You can do that using selenium.Couple of things you need to update.

While interacting headless you need to provide window size.

Induce WebDriverWait() to avoid synchronization issue.

Code:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')
options.add_argument('window-size=1920,1080')

wd = webdriver.Chrome(options=options)

companies = ['Anglo American plc']

for company in companies:
  wd.get('https://sray.arabesque.com/dashboard/')
  WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='list']"))).click()
  WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@id='mat-input-0']"))).send_keys(company)
  WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[contains(.,' Anglo American plc ')]"))).click()
  WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "(//span[normalize-space(.)='Open dashboard'])[1]"))).click()
  WebDriverWait(wd,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"div.mat-tab-labels")))
  print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'GC Score')]/span").text)
  print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'ESG Score')]/span").text)
  print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'Temp')]/span").text)

Output:

57.03
53.78
2.7°C


 



回答2:


Without exactly knowing why you want to use selenium, use the search and then getting another site, here is what I would do to get the data you are looking for:

import requests
import json

session = requests.Session()
url = 'https://srayapi.arabesque.com/api/sray/q'
response = session.get(url).json()

rays = response['data']['rays']
[ray for ray in rays if ray['name'].startswith('Anglo American')]

Then do whatever you want, so for esg, gc and temperature perhaps:

myObj = [{result['name']: {'gc': result['gc'], 'esg': result['esg'], 'temp': result['score_near']}} for result in results]


来源:https://stackoverflow.com/questions/62892495/scraping-a-specific-website-with-a-search-box-and-javascripts-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!