Scrape Dynamic contents created by Javascript using Python

前端 未结 1 1624
庸人自扰
庸人自扰 2020-12-04 00:51

I want to scrap DIV content created by javascript function by using python script. I have tried with BS4 and by doing with that i\'m not able to get dynamic data. instead it

相关标签:
1条回答
  • 2020-12-04 01:26

    The initial HTML does not contain the data you want to scrape, that's why using only BeautifulSoup is not enough. You can load the page with Selenium and then scrape the content.

    Code:

    import json
    
    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    from selenium.common.exceptions import TimeoutException
    
    html = None
    url = 'http://demo-tableau.bitballoon.com/'
    selector = '#dataTarget > div'
    delay = 10  # seconds
    
    browser = webdriver.Chrome()
    browser.get(url)
    
    try:
        # wait for button to be enabled
        WebDriverWait(browser, delay).until(
            EC.element_to_be_clickable((By.ID, 'getData'))
        )
        button = browser.find_element_by_id('getData')
        button.click()
    
        # wait for data to be loaded
        WebDriverWait(browser, delay).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, selector))
        )
    except TimeoutException:
        print('Loading took too much time!')
    else:
        html = browser.page_source
    finally:
        browser.quit()
    
    if html:
        soup = BeautifulSoup(html, 'lxml')
        raw_data = soup.select_one(selector).text
        data = json.loads(raw_data)
    
        import pprint
        pprint.pprint(data)
    

    Output:

    [[{'formattedValue': 'Atlantic', 'value': 'Atlantic'},
      {'formattedValue': '6/26/2010 3:00:00 AM', 'value': '2010-06-26 03:00:00'},
      {'formattedValue': 'ALEX', 'value': 'ALEX'},
      {'formattedValue': '16.70000', 'value': '16.7'},
      {'formattedValue': '-84.40000', 'value': '-84.4'},
      {'formattedValue': '30', 'value': '30'}],
      ...
    ]
    

    The code assumes that the button is initially disabled: <button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button> and data is not loaded automatically, but due to the button being clicked. Therefore you need to delete this line: setTimeout(function(){ getUnderlyingData(); }, 3000);.

    You can find a working demo of your example here: http://demo-tableau.bitballoon.com/.

    0 讨论(0)
提交回复
热议问题