问题
I am trying to scrape odds from a site that displays current odds from different agencies for an assignment on the effects of market competition. I am using Requests and BeautifulSoup to extract the relevant data. However after using:
import requests
from bs4 import BeautifulSoup
url = "https://www.bestodds.com.au/odds/cricket/ICC-World-Twenty20/Sri-Lanka-v-Afghanistan_71992/"
r=requests.get(url)
Print(r.text)
It does not print any odds, yet if I inspect the element on the page I can see them in the html. How do I get Requests to import them into Python to extract?
回答1:
requests
is not quite suitable to use in this case - the site is quite dynamic and uses multiple XHR requests and javascript to form the page. A quicker and much less painful way to get to the desired information would be to use a real browser automated via selenium.
Here is an example code to get you started - headless PhantomJS browser is used:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("https://www.bestodds.com.au/odds/cricket/ICC-World-Twenty20/Sri-Lanka-v-Afghanistan_71992/")
# waiting for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".odds-comparison")))
for comparison in driver.find_elements_by_css_selector(".odds-comparison"):
description = comparison.find_element_by_css_selector(".description").text
print(description)
driver.close()
It prints all the odds table descriptions on the page:
MATCH ODDS
MOST SIXES
TOP SRI LANKA BATSMAN
TOP AFGHANISTAN BATSMAN
回答2:
It better to use urlopen :
import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://www.bestodds.com.au/odds/cricket/ICC-World-Twenty20/Sri-Lanka-v-Afghanistan_71992/"
response = urlopen(url)
htmltext = BeautifulSoup(response)
print (htmltext)
after that you can find what ever you want :
Liste_page =htmltext.find('div',{"id":"pager"}).text
Tr=htmltext.find('table',{"class":"additional_data"}).findNext('tbody').text
回答3:
The data is most likely loaded dynamically.
It is not in the HTML.
You can try to understand which requests are used to retrieve the real data, or try using e.g. selenium webdriver to simulate a real browser (this second option will be much slower).
Beware that you most likely violate the terms of usage of that site. This can easily get you into trouble. They may also try to deliberately serve you bad data.
来源:https://stackoverflow.com/questions/36060624/all-elements-from-html-not-being-extracted-by-requests-and-beautifulsoup-in-pyth