parsing site with beautifulsoup

后端 未结 2 985
野的像风
野的像风 2021-01-14 14:57

i\'m trying to learn how to parse html with python and i`m currently stuck with soup.findAll return me an empty array,therefore there are elements which could be found Here

相关标签:
2条回答
  • 2021-01-14 15:35

    Apparently, the page only loades the "odds" parts once it is called in a browser. So you could use Selenium and Chrome driver.

    Note that you need to download the Chrome driver and place the driver in your .../python/ directory. Make sure you choose a matching driver version, meaning a version of Chrome driver that matches the version of the Chrome browser you have installed.

    from bs4 import BeautifulSoup 
    from urllib.request import urlopen 
    import requests, time, traceback, random, csv, codecs, re, os
    
    # Webdriver
    from selenium import webdriver
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.by import By
    
    options = webdriver.ChromeOptions()
    options.add_argument('log-level=3')
    browser = webdriver.Chrome(chrome_options=options)
    
    url = 'https://www.oddsportal.com/matches/tennis/20191114/'
    browser.get(url)
    soup = BeautifulSoup(browser.page_source, "html.parser")
    info = soup.findAll('tr', {'class':'odd deactivate'})
    print(info) 
    
    0 讨论(0)
  • 2021-01-14 15:47

    i'm trying to learn how to parse html with python

    You happened to pick a webpage which isn't very beginner-friendly when it comes to webscraping. Broadly speaking, most webpages use one or both of these two common methods for loading / displaying data:

    • The user makes a request to a server (visits a page, for example). The server gets the necessary data from a database. The server generates an HTML response using a templating engine, and returns the response for the user's browser to render.
    • The user makes a request to a server. The server returns an HTML-skeleton response which gets populated with data dynamically by making other requests / using APIs etc.

    The webpage you picked is of the second type. Just because you can see the <tr> elements in the "Elements" tab of Chrome's Dev Tools doesn't mean that that's what the server sent you. By looking at the network tab of Chrome's Dev Tools you can see that a request is made to these two resources: https://fb.oddsportal.com/ajax-next-games/2/0/1/20191114/yje3d.dat?=1574007087150 https://fb.oddsportal.com/ajax-next-games-odds/2/0/X0/20191114/1/yje3d.dat?=1574007087151

    (The Query String parameters will not be the same for you. Visiting those urls also won't be very interesting unless you provide the right payload.)

    The first resource seems to be a jQuery script which makes a request, the response of which contains HTML (this is your table). It looks something like this:

    You can see that they seem to have assigned unique IDs to each of the matches. Giron Marcos vs. Holt Brandon in this case has an ID of ATM9GmXG.

    The second resource is similar. It's also a jQuery script which seems to be making a request to their main API. The response this time is JSON, which is always desirable for webscraping. Here's what part of that looks like (notice the same ID):

    0 讨论(0)
提交回复
热议问题