I\'m using the following code to obtain all content from a webpage (see url in code):
import urllib2
from bs4 impor
First thing to understand is that neither BeautifulSoup
, nor urllib2
is a browser. urllib2
would only get/download you the initial "static" page - it cannot execute JavaScript as a real browser would do. Hence, you will always get the "View Page Source" content.
To solve your problem - fire up a real browser via selenium, wait for the page to load, get the .page_source
and pass it to BeautifulSoup
to parse:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")
# wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fluid-width-video-wrapper")))
# get the page source
page_source = driver.page_source
driver.close()
# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)
This is the general approach, but your case is a little bit different - there is an iframe
element which contains the video player. If you want to access the script
elements inside the iframe
, you would need to switch to it and then get the .page_source
:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")
# wait for the page to load, switch to iframe
wait = WebDriverWait(driver, 10)
frame = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe[src*=video]")))
driver.switch_to.frame(frame)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".controls")))
# get the page source
page_source = driver.page_source
driver.close()
# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)