Use BeautifulSoup to obtain “View Element” code instead of “View Source” code

前端 未结 1 923
悲&欢浪女
悲&欢浪女 2021-02-04 23:13

I\'m using the following code to obtain all content from a webpage (see url in code):

import urllib2
from bs4 impor         


        
1条回答
  •  醉梦人生
    2021-02-04 23:17

    First thing to understand is that neither BeautifulSoup, nor urllib2 is a browser. urllib2 would only get/download you the initial "static" page - it cannot execute JavaScript as a real browser would do. Hence, you will always get the "View Page Source" content.

    To solve your problem - fire up a real browser via selenium, wait for the page to load, get the .page_source and pass it to BeautifulSoup to parse:

    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    driver = webdriver.Firefox()
    driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")
    
    # wait for the page to load
    wait = WebDriverWait(driver, 10)
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fluid-width-video-wrapper")))
    
    # get the page source
    page_source = driver.page_source
    
    driver.close()
    
    # parse the HTML
    soup = BeautifulSoup(page_source, "html.parser")
    script = soup.find_all("script")
    print(script)
    

    This is the general approach, but your case is a little bit different - there is an iframe element which contains the video player. If you want to access the script elements inside the iframe, you would need to switch to it and then get the .page_source:

    from bs4 import BeautifulSoup
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    driver = webdriver.Firefox()
    driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")
    
    # wait for the page to load, switch to iframe
    wait = WebDriverWait(driver, 10)
    frame = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe[src*=video]")))
    driver.switch_to.frame(frame)
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".controls")))
    
    # get the page source
    page_source = driver.page_source
    
    driver.close()
    
    # parse the HTML
    soup = BeautifulSoup(page_source, "html.parser")
    script = soup.find_all("script")
    print(script)
    

    0 讨论(0)
提交回复
热议问题