WebScraping JavaScript-Rendered Content using Selenium in Python

浪尽此生 提交于 2021-02-02 02:08:45

问题


I am very new to web scraping and have been trying to use Selenium's functions to simulate a browser accessing the Texas public contracting webpage and then download embedded PDFs. The website is this: http://www.txsmartbuy.com/sp.

So far, I've successfully used Selenium to select an option in one of the dropdown menus "Agency Name" and to click the search button. I've listed my Python code below.

import os
os.chdir("/Users/fsouza/Desktop") #Setting up directory

from bs4 import BeautifulSoup #Downloading pertinent Python packages
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

chromedriver = "/Users/fsouza/Desktop/chromedriver" #Setting up Chrome driver
driver = webdriver.Chrome(executable_path=chromedriver)
driver.get("http://www.txsmartbuy.com/sp")
delay = 3 #Seconds

WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, "//select[@id='agency-name-filter']/option[69]")))    
health = driver.find_element_by_xpath("//select[@id='agency-name-filter']/option[68]")
health.click()
search = driver.find_element_by_id("spBtnSearch")
search.click()

Once I get to the results page, I get stuck.

First, I can't access any of the resulting links using the html page source. But if I manually inspect individual links in Chrome, I do find the pertinent tags (<a href...) relating to individual results. I'm guessing this is because of JavaScript-rendered content.

Second, even if Selenium were able to see these individual tags, they have no class or id. The best way to call them, I think, would be by calling <a tags by the order shown (see code below) but this didn't work either. Instead, the link calls some other 'visible' tag (something in the footer, which I don't need).

Third, assuming these things did work, how can I figure out the number of <a> tags showing on the page (in order to loop this code over an over for every single result)?

driver.execute_script("document.getElementsByTagName('a')[27].click()")

I would appreciate your attention to this––and please excuse any stupidity on my part, considering that I'm just starting out.


回答1:


To scrape the JavaScript-Rendered Content using Selenium you need to:

  • Induce WebDriverWait for the desired element to be clickable().

  • Induce WebDriverWait for the visibility of all elements located().

  • Open each link in a new tab using Ctrl and click() through ActionChains

  • Induce WebDriverWait and switch to the new tab to webscrape.

  • Switch back to the main page.

  • Code Block:

      from selenium import webdriver
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.support import expected_conditions as EC
      from selenium.webdriver.common.action_chains import ActionChains
      from selenium.webdriver.common.keys import Keys
      import time
    
      options = webdriver.ChromeOptions() 
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
      driver.get("http://www.txsmartbuy.com/sp")
      windows_before  = driver.current_window_handle
      WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='agency-name-filter' and @name='agency-name']"))).click()
      WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//select[@id='agency-name-filter' and @name='agency-name']//option[contains(., 'Health & Human Services Commission - 529')]"))).click()
      WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[@id='spBtnSearch']/i[@class='icon-search']"))).click()
      for link in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table/tbody//tr/td/strong/a"))):
          ActionChains(driver).key_down(Keys.CONTROL).click(link).key_up(Keys.CONTROL).perform()
          WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
          windows_after = driver.window_handles
          new_window = [x for x in windows_after if x != windows_before][0]
          driver.switch_to_window(new_window)
          time.sleep(3)
          print("Focus on the newly opened tab and here you can scrape the page")
          driver.close()
          driver.switch_to_window(windows_before)
      driver.quit()
    
  • Console Output:

      Focus on the newly opened tab and here you can scrape the page
      Focus on the newly opened tab and here you can scrape the page
      Focus on the newly opened tab and here you can scrape the page
      .
      .
    
  • Browser Snapshot:

scrape


References

You can find a couple of relevant detailed discussions in:

  • How to open multiple hrefs within a webtable to scrape through selenium
  • StaleElementReferenceException even after adding the wait while collecting the data from the wikipedia using web-scraping
  • Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python
  • How to open each product within a website in a new tab for scraping using Selenium through Python



回答2:


To get <a> tags you mean in the result, use the following xpath:

//tbody//tr//td//strong//a

After click search button, you can extract them with loop. First you need all the elements located with .visibility_of_all_elements_located:

search.click()

elements = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, "//tbody//tr//td//strong//a")))

print(len(elements))

for element in elements:
    get_text = element.text 
    print(get_text)
    url_number = element.get_attribute('onclick').replace('window.open("/sp/', '').replace('");return false;', '')
    get_url = 'http://www.txsmartbuy.com/sp/' +url_number
    print(get_url)

Result one of them:

IFB HHS0006862, Blanket, San Angelo Canteen Resale. 529-96596. http://www.txsmartbuy.com/sp/HHS0006862



来源:https://stackoverflow.com/questions/59144599/webscraping-javascript-rendered-content-using-selenium-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!