Python beautifulsoup extract value without identifier

问题

I am facing a problem and don't know how to solve it properly. I want to extract the price (so in the first example 130€, in the second 130€).

the problem is that the attributes are changing all the time. so I am unable to do something like this, because I am scraping hundreds of sites and and on each site the first 2 chars of the "id" attribute may differ:

tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'(07_content$)')})

Even if I would use something like this it wont work, because there is no link to the price and I would probably get some other value:

tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'([0-9]{2}_content$)')})

Example html code:

<span id="07_lbl" class="lbl">Price:</span>
<span id="07_content" class="content">130  €</span>
<span id="08_lbl" class="lbl">Value:</span>
<span id="08_content" class="content">90000  €</span>


<span id="03_lbl" class="lbl">Price:</span>
<span id="03_content" class="content">130  €</span>
<span id="04_lbl" class="lbl">Value:</span>
<span id="04_content" class="content">90000  €</span>

The only thing I can imagine of at the moment is to identify the price tag with something like "text = 'Price:'" and after that get .next_sibling and extract the string. but I am not sure if there is better way to do it. Any suggestions? :-)

回答1:

Here is how you would easily extract only the price values like you had in mind in your original post.

html = """
        <span id="07_lbl" class="lbl">Price:</span>
        <span id="07_content" class="content">130  €</span>
        <span id="08_lbl" class="lbl">Value:</span>
        <span id="08_content" class="content">90000  €</span>


        <span id="03_lbl" class="lbl">Price:</span>
        <span id="03_content" class="content">130  €</span>
        <span id="04_lbl" class="lbl">Value:</span>
        <span id="04_content" class="content">90000  €</span>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

price_texts = soup.find_all('span', text='Price:')
for element in price_texts:
    # .next_sibling() might work, too, with a parent element present
    price_value = element.find_next_sibling('span')
    print price_value.get_text()

# It prints:
# 130  €
# 130  €

This solution has less code and, IMO, is more clear.

回答2:

How about a findAll solution?
First collect all possibles id prefixes and then iterate them and get all elements

>>> from bs4 import BeautifulSoup
>>> import re
>>> html = """
...         <span id="07_lbl" class="lbl">Price:</span>
...         <span id="07_content" class="content">130  €</span>
...         <span id="08_lbl" class="lbl">Value:</span>
...         <span id="08_content" class="content">90000  €</span>
... 
... 
...         <span id="03_lbl" class="lbl">Price:</span>
...         <span id="03_content" class="content">130  €</span>
...         <span id="04_lbl" class="lbl">Value:</span>
...         <span id="04_content" class="content">90000  €</span>
... """
>>> 
>>> soup = BeautifulSoup(html)
>>> span_id_prefixes = [
...     span['id'].replace("_content","")
...     for span in soup.findAll('span', attrs={'id' : re.compile(r'(_content$)')})
... ]
>>> for prefix in span_id_prefixes:
...     lbl     = soup.find('span', attrs={'id' : '%s_lbl' % prefix})
...     content = soup.find('span', attrs={'id' : '%s_content' % prefix})
...     if lbl and content:
...         print lbl.text, content.text
... 
Price: 130  €
Value: 90000  €
Price: 130  €
Value: 90000  €

回答3:

Try Beautiful soup selects function. It uses css selectors:

for span in soup_expose_html.select("span[id$=_content]"):
    print span.text

the result is a list with all spans which have an id ending with _content

来源：https://stackoverflow.com/questions/26094958/python-beautifulsoup-extract-value-without-identifier

标签

python

regex

beautifulsoup