How to scrape the Javascript based site https://marketchameleon.com/Calendar/Earnings using Selenium and Python?

谁都会走 提交于 2021-01-29 06:18:43

问题


I am trying to get earning dates from https://marketchameleon.com/Calendar/Earnings The site has a javascript loader that loads the earnings table, but when I am using selenium it is not appears. I tried chrome and firefox drivers.

a sample of the code:

firefox_driver_path = os.path.abspath('../firefoxdriver_win32/geckodriver.exe')
options = webdriver.FirefoxOptions()
options.add_argument("--enable-javascript")
driver = webdriver.Firefox(executable_path=firefox_driver_path, options=options)
driver.get("https://marketchameleon.com/Calendar/Earnings")

how can I get the data?


回答1:


I took your code added a few tweaks and ran a test to extract the earning dates from https://marketchameleon.com/Calendar/Earnings as follows:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get('https://marketchameleon.com/Calendar/Earnings')
    print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.dateselect_menu_h_table tr > th > span"))).text)
    print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='dateselect_menu_h_table']//tr/th/span"))).get_attribute("innerHTML"))
    

Observation

Similar to your observation, I have hit the same roadblock that using Selenium the earnings table doesn't loads:

marketchameleon


Deep Dive

While inspecting the DOM Tree of the webpage I found that some of the <script> and other tags refers to the keyword akam. As an example:

  • !function(){if(BOOMR=a.BOOMR||{},BOOMR.plugins=BOOMR.plugins||{},!BOOMR.plugins.AK){var e=""=="true"?1:0,t="",n="gertvyrrfrzvsxxfd3ta-f-81b1f5d51-clientnsv4-s.akamaihd.net"
  • <script type="text/javascript" src="https://marketchameleon.com/akam/11/4e7414cb" defer=""></script>
  • <noscript><img src="https://marketchameleon.com/akam/11/pixel_4e7414cb?a=dD03OTIxZTlmM2QwMWVhMDkxODhjNzQwN2E3NmFkNzRiMDQ5ODBkOGU0JmpzPW9mZg==" style="visibility: hidden; position: absolute; left: -999px; top: -999px;" /></noscript>
  • <link id="dnsprefetchlink" rel="dns-prefetch" href="//gertvyrrfrzvsxxfd3ta-f-81b1f5d51-clientnsv4-s.akamaihd.net">

Which is a clear indication that the website is protected by Bot Manager an advanced bot detection service provided by Akamai and the response gets blocked.


Bot Manager

As per the article Bot Manager - Foundations:

akamai_detection


Conclusion

So it can be concluded that the request for the data is detected as being performed by Selenium driven WebDriver instance and the response is blocked.


References

A couple of documentations:

  • Bot Manager
  • Bot Manager : Foundations

tl; dr

A couple of relevant discussions:

  • Can a website detect when you are using selenium with chromedriver?
  • Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection


来源:https://stackoverflow.com/questions/62353469/how-to-scrape-the-javascript-based-site-https-marketchameleon-com-calendar-ear

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!