问题
I'm creating a script where I'm trying to rip m4a files from a website specifically. I'm using BS4 and selenium for this purpose presently.
I'm having some trouble getting the info. The file link is not located in the HTML source for the page. Instead, I can only find it in the console. The link I'm trying to get is here in this image (https://imgur.com/a/DLwcE0p) labeled "audio_url_m4a:".
Here's some sample code I'm using:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities\
d = DesiredCapabilities.CHROME
d['loggingPrefs'] = {'browser':'ALL ' }
driver = webdriver.Chrome(r'chromedriver path', desired_capabilities = d)
~~lots of code doing other things not relevant to the post~~
for URL in audm_URL: #this is referencing a line of code where I construct a list of URLs
driver.get(audm)
time.sleep(3)
for entry in driver.get_log('browser'):
print(entry)
Here is the output I get:
{'level': 'SEVERE', 'message': 'https://audm.herokuapp.com/favicon.ico - Failed to load resource: the server responded with a status of 404 (Not Found)', 'source': 'network', 'timestamp': 1611291689357}
{'level': 'SEVERE', 'message': 'https://cdn.segment.com/analytics.js/v1/5DOhLj2nIgYtQeSfn9YF5gpAiPqRtWSc/analytics.min.js - Failed to load resource: net::ERR_NAME_NOT_RESOLVED', 'source': 'network', 'timestamp': 1611291689357}
Most questions relating to grabbing things from the console point me towards grabbing the logs, but nothing that seems to let me know how to grab those other variables. Any ideas?
Here's a link to a random audio page that I want to grab the file from: https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c
Thanks everyone!
回答1:
driver.get(
"https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"button"))).click()
src=WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".react-player video"))).get_attribute("src")
print(src)
if you just want to get src you can use above code .
you need to import
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
If you want to get it through console log then use : IT SEEMS ITS WORKING ONLY FOR HEADLESS I AM INVESTIGATING:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
capabilities = webdriver.DesiredCapabilities().CHROME.copy()
capabilities['loggingPrefs'] = {'browser': 'ALL'}
driver = webdriver.Chrome(options=options,desired_capabilities=capabilities)
driver.maximize_window()
time.sleep(3)
driver.get(
"https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")
for entry in driver.get_log('browser'):
print(entry)
Update
in headless mode w3c is false and hence it is working ,
For non headless mode you have to use:
options.add_experimental_option('w3c', False)
回答2:
This did the trick. I was looking at it the wrong way and wasn't trying to get an src. Thanks for the input!
来源:https://stackoverflow.com/questions/65839595/capturing-info-from-console-using-python