i\'m trying to learn how to parse html with python and i`m currently stuck with soup.findAll return me an empty array,therefore there are elements which could be found Here
Apparently, the page only loades the "odds" parts once it is called in a browser. So you could use Selenium and Chrome driver.
Note that you need to download the Chrome driver and place the driver in your .../python/
directory. Make sure you choose a matching driver version, meaning a version of Chrome driver that matches the version of the Chrome browser you have installed.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests, time, traceback, random, csv, codecs, re, os
# Webdriver
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument('log-level=3')
browser = webdriver.Chrome(chrome_options=options)
url = 'https://www.oddsportal.com/matches/tennis/20191114/'
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
info = soup.findAll('tr', {'class':'odd deactivate'})
print(info)
i'm trying to learn how to parse html with python
You happened to pick a webpage which isn't very beginner-friendly when it comes to webscraping. Broadly speaking, most webpages use one or both of these two common methods for loading / displaying data:
The webpage you picked is of the second type. Just because you can see the <tr>
elements in the "Elements" tab of Chrome's Dev Tools doesn't mean that that's what the server sent you. By looking at the network tab of Chrome's Dev Tools you can see that a request is made to these two resources:
https://fb.oddsportal.com/ajax-next-games/2/0/1/20191114/yje3d.dat?=1574007087150
https://fb.oddsportal.com/ajax-next-games-odds/2/0/X0/20191114/1/yje3d.dat?=1574007087151
(The Query String parameters will not be the same for you. Visiting those urls also won't be very interesting unless you provide the right payload.)
The first resource seems to be a jQuery script which makes a request, the response of which contains HTML (this is your table). It looks something like this:
You can see that they seem to have assigned unique IDs to each of the matches. Giron Marcos vs. Holt Brandon in this case has an ID of ATM9GmXG
.
The second resource is similar. It's also a jQuery script which seems to be making a request to their main API. The response this time is JSON, which is always desirable for webscraping. Here's what part of that looks like (notice the same ID):