How to get only first class' data between two same classes

问题

On https://www.hltv.org/matches page, matches divided by dates but the classes are same. I mean,

This is today's match class

<div class="match-day"><div class="standard-headline">2018-05-01</div>

This is tommorow's match class.

<div class="match-day"><div class="standard-headline">2018-05-02</div>

What i'm trying to do is, I wanna get the links under the "standard-headline" class but only today's matches. Like, getting the only first one.

Here is my code.

import urllib.request
from bs4 import BeautifulSoup
headers = {}  # Headers gives information about you like your operation system, your browser etc.
headers['User-Agent'] = 'Mozilla/5.0'  # I defined a user agent because HLTV perceive my connection as bot.
hltv = urllib.request.Request('https://www.hltv.org/matches', headers=headers)  # Basically connecting to website
session = urllib.request.urlopen(hltv)
sauce = session.read()  # Getting the source of website
soup = BeautifulSoup(sauce, 'lxml')

matchlinks = []
# Getting the match pages' links.
for links in soup.find_all('div', class_='upcoming-matches'):  # Looking for "upcoming-matches" class in source.
    for links in soup.find_all('a'):  # Finding "a" tag under "upcoming-matches" class.
        clearlink = links.get('href')  # Getting the value of variable.
        if clearlink.startswith('/matches/'):  # Checking for if our link starts with "/matches/"
            matchlinks.append('https://hltv.org' + clearlink)  # Adding into list.

回答1:

Actually, the website shows today's matches first (at the top), and then the next days'. So, if you want to get today's matches, you can simply use find(), which return the first match found.

Using this will give you what you want:

today = soup.find('div', class_='match-day')

But, if you want to explicitly specify the date, you can find the tag containing today's date, by using text='2018-05-02' as a parameter for the find() method. But, note that in the page source, the tag is <span class="standard-headline">2018-05-02</span> and not a <div> tag. After getting this tag, use .parent to get the <div class="match-day"> tag.

today = soup.find('span', text='2018-05-02').parent

Again, if you want to make the solution more generic, you can use datetime.date.today() instead of the hard-coded date.

today = soup.find('span', text=datetime.date.today()).parent

You'll have to import the datetime module for this.

来源：https://stackoverflow.com/questions/50120344/how-to-get-only-first-class-data-between-two-same-classes

标签

python

python-3.x

beautifulsoup

urllib