parsing site with beautifulsoup

后端未结

关注

 2  985

i\'m trying to learn how to parse html with python and i`m currently stuck with soup.findAll return me an empty array,therefore there are elements which could be found Here

相关标签:

2条回答

面向向阳花

2021-01-14 15:35

Apparently, the page only loades the "odds" parts once it is called in a browser. So you could use Selenium and Chrome driver.

Note that you need to download the Chrome driver and place the driver in your .../python/ directory. Make sure you choose a matching driver version, meaning a version of Chrome driver that matches the version of the Chrome browser you have installed.

from bs4 import BeautifulSoup 
from urllib.request import urlopen 
import requests, time, traceback, random, csv, codecs, re, os

# Webdriver
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions()
options.add_argument('log-level=3')
browser = webdriver.Chrome(chrome_options=options)

url = 'https://www.oddsportal.com/matches/tennis/20191114/'
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
info = soup.findAll('tr', {'class':'odd deactivate'})
print(info)

0 讨论(0)

南方客

2021-01-14 15:47
i'm trying to learn how to parse html with python

You happened to pick a webpage which isn't very beginner-friendly when it comes to webscraping. Broadly speaking, most webpages use one or both of these two common methods for loading / displaying data:
- The user makes a request to a server (visits a page, for example). The server gets the necessary data from a database. The server generates an HTML response using a templating engine, and returns the response for the user's browser to render.
- The user makes a request to a server. The server returns an HTML-skeleton response which gets populated with data dynamically by making other requests / using APIs etc.
The webpage you picked is of the second type. Just because you can see the <tr> elements in the "Elements" tab of Chrome's Dev Tools doesn't mean that that's what the server sent you. By looking at the network tab of Chrome's Dev Tools you can see that a request is made to these two resources: https://fb.oddsportal.com/ajax-next-games/2/0/1/20191114/yje3d.dat?=1574007087150 https://fb.oddsportal.com/ajax-next-games-odds/2/0/X0/20191114/1/yje3d.dat?=1574007087151

(The Query String parameters will not be the same for you. Visiting those urls also won't be very interesting unless you provide the right payload.)

The first resource seems to be a jQuery script which makes a request, the response of which contains HTML (this is your table). It looks something like this:

You can see that they seem to have assigned unique IDs to each of the matches. Giron Marcos vs. Holt Brandon in this case has an ID of ATM9GmXG.

The second resource is similar. It's also a jQuery script which seems to be making a request to their main API. The response this time is JSON, which is always desirable for webscraping. Here's what part of that looks like (notice the same ID):
0 讨论(0)
发布评论:

提交评论
- 加载中...