How to scrape the first element of each parent using from The Wall Street Journal market-data quotes using Selenium and Python?

て烟熏妆下的殇ゞ 提交于 2021-02-05 10:46:21

问题


Here is the HTML that I'm trying to scrape:

I am trying to get the first instance of 'td' under each 'tr' using Selenium (beautifulsoup won't work for this site). The list is very long so I am trying to do it iteratively. Here is my code:

from selenium import webdriver
import os


# define path to chrome driver
chrome_driver = os.path.abspath('C:/Users/USER/Desktop/chromedriver.exe')
browser = webdriver.Chrome(chrome_driver)
browser.get("https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement")

# get entire table
table = browser.find_element_by_xpath('//*[@id="cr_cashflow"]/div[2]/div/table')

#web element is not iterable
for row in table.find_element_by_tag_name('tr'):
    td = row.find_element_by_tag_name('td')
    print(td.text)

#web element is not subscriptable
for row in table.find_elements_by_tag_name('tr'):
    print(row[0].text)

I tried both of the for loops that I have above and the first one had an error saying webelement is not iterable where the second one said it isn't subscriptable. What is the difference between the two? How would I be able to change my code so that I could return "Sales/Revenue, Premiums Earned, ..."?


回答1:


To get first of td under each tr you mean, use this css selector:

table.cr_dataTable tbody tr td[class]:nth-child(1)

Try following code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os

chrome_driver = os.path.abspath('C:/Users/USER/Desktop/chromedriver.exe')
browser = webdriver.Chrome(chrome_driver)

browser.get('https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement')

elements = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'table.cr_dataTable tbody tr td[class]:nth-child(1)')))
for element in elements:
    print(element.text)

browser.quit()



回答2:


You can try get table with pandas Trying to scrape table using Pandas from Selenium's result

from selenium import webdriver
import pandas as pd
import os


# define path to chrome driver
chrome_driver = os.path.abspath('C:/Users/USER/Desktop/chromedriver.exe')
browser = webdriver.Chrome(chrome_driver)
browser.get("https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement")

# get table
df = pd.read_html(browser.page_source)[0]

# get values
val = [i for i in df["Fiscal year is January-December. All values USD Millions."].values if isinstance(i, str)]



回答3:


I took your code and simplified the structure and ran the test with minimal lines of code as follows:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait


options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement')
print(driver.page_source)
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.cr_dataTable tbody tr>td[class]")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='cr_dataTable']//tbody//tr/td[@class]")))])

Similarly, as per your observation I have hit the same roadblock that my tests didn't yeild and results.

While inspecting the Page Source of the webpage it was observed that there is an EventListener within a <script> which validates certain page metrics and some of them are:

  • window.utag_data
  • window.utag_data.page_performance
  • window.PerformanceTiming
  • window.PerformanceObserver
  • newrelic
  • first-contentful-paint

Page Source:

<script>
    "use strict";

    if (window.PerformanceTiming) {
      window.addEventListener('DOMContentLoaded', function () {
        if (window.utag_data && window.utag_data.page_performance) {
          var dcl = 'DCL ' + parseInt(performance.timing.domContentLoadedEventStart - performance.timing.domLoading);
          var pp = window.utag_data.page_performance.split('|');
          pp[1] = dcl;
          utag_data.page_performance = pp.join('|');
        } else {
          console.warn('No utag_data.page_performance available');
        }
      });
    }

    if (window.PerformanceTiming && window.PerformanceObserver) {
      var observer = new PerformanceObserver(function (list) {
        var entries = list.getEntries();

        var _loop = function _loop(i) {
          var entry = entries[i];
          var metricName = entry.name;
          var time = Math.round(entry.startTime + entry.duration);

          if (typeof newrelic !== 'undefined') {
            newrelic.setCustomAttribute(metricName, time);
          }

          if (entry.name === 'first-contentful-paint' && window.utag_data && window.utag_data.page_performance) {
            var fcp = 'FCP ' + parseInt(entry.startTime);
            var pp = utag_data.page_performance.split('|');
            pp[0] = fcp;
            utag_data.page_performance = pp.join('|');
          } else {
            window.addEventListener('DOMContentLoaded', function () {
              if (window.utag_data && window.utag_data.page_performance) {
                var _fcp = 'FCP ' + parseInt(entry.startTime);

                var _pp = utag_data.page_performance.split('|');

                _pp[0] = _fcp;
                utag_data.page_performance = _pp.join('|');
              } else {
                console.warn('No utag_data.page_performance available');
              }
            });
          }
        };

        for (var i = 0; i < entries.length; i++) {
          _loop(i);
        }
      });

      if (window.PerformancePaintTiming) {
        observer.observe({
          entryTypes: ['paint', 'mark', 'measure']
        });
      } else {
        observer.observe({
          entryTypes: ['mark', 'measure']
        });
      }
    }
  </script> <script>
if (window && typeof newrelic !== 'undefined') {
  newrelic.setCustomAttribute('browserWidth', window.innerWidth);
}
</script> <title>MET | MetLife Inc. Annual Income Statement - WSJ</title> <link rel="canonical" href="https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement">

Conclusion

This is a clear indication that the website is protected by vigorous Bot Management techniques and the navigation by Selenium driven WebDriver initiated Browsing Context gets detected and subsequently blocked.


Reference

You can find a relevant discussions in:

  • Can a website detect when you are using selenium with chromedriver?


来源:https://stackoverflow.com/questions/62605541/how-to-scrape-the-first-element-of-each-parent-using-from-the-wall-street-journa

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!