All elements from html not being extracted by Requests and BeautifulSoup in Python

故事扮演 提交于 2020-12-15 06:11:25

问题


I am trying to scrape odds from a site that displays current odds from different agencies for an assignment on the effects of market competition. I am using Requests and BeautifulSoup to extract the relevant data. However after using:

import requests
from bs4 import BeautifulSoup

url = "https://www.bestodds.com.au/odds/cricket/ICC-World-Twenty20/Sri-Lanka-v-Afghanistan_71992/"

r=requests.get(url)
Print(r.text)

It does not print any odds, yet if I inspect the element on the page I can see them in the html. How do I get Requests to import them into Python to extract?


回答1:


requests is not quite suitable to use in this case - the site is quite dynamic and uses multiple XHR requests and javascript to form the page. A quicker and much less painful way to get to the desired information would be to use a real browser automated via selenium.

Here is an example code to get you started - headless PhantomJS browser is used:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.PhantomJS()
driver.get("https://www.bestodds.com.au/odds/cricket/ICC-World-Twenty20/Sri-Lanka-v-Afghanistan_71992/")

# waiting for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".odds-comparison")))

for comparison in driver.find_elements_by_css_selector(".odds-comparison"):
    description = comparison.find_element_by_css_selector(".description").text
    print(description)

driver.close()

It prints all the odds table descriptions on the page:

MATCH ODDS
MOST SIXES
TOP SRI LANKA BATSMAN
TOP AFGHANISTAN BATSMAN



回答2:


It better to use urlopen :

   import urllib
   from bs4 import BeautifulSoup
   from urllib.request import urlopen

   url = "https://www.bestodds.com.au/odds/cricket/ICC-World-Twenty20/Sri-Lanka-v-Afghanistan_71992/"

   response = urlopen(url)
   htmltext = BeautifulSoup(response)
   print (htmltext)

after that you can find what ever you want :

   Liste_page =htmltext.find('div',{"id":"pager"}).text
   Tr=htmltext.find('table',{"class":"additional_data"}).findNext('tbody').text



回答3:


The data is most likely loaded dynamically.

It is not in the HTML.

You can try to understand which requests are used to retrieve the real data, or try using e.g. selenium webdriver to simulate a real browser (this second option will be much slower).

Beware that you most likely violate the terms of usage of that site. This can easily get you into trouble. They may also try to deliberately serve you bad data.



来源:https://stackoverflow.com/questions/36060624/all-elements-from-html-not-being-extracted-by-requests-and-beautifulsoup-in-pyth

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!