Find data within HTML tags using Python

问题

I have the following HTML code I am trying to scrape from a website:

<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>

What I am trying to accomplish is to search the page to find the text "Net Taxes Due" within the tag, find the siblings of the tag, and send the results into a Pandas data frame.

I have the following code:

soup = BeautifulSoup(url, "html.parser")
table = soup.select('#Net Taxes Due')

cells = table.find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]

df = pd.DataFrame(np.array(cells))

print(df)

I've been all over the web looking for a solution and can't come up with something. Appreciate any help.

Thanks!

回答1:

Make sure to add the tag name along with your search string. This is how you can do that:

from bs4 import BeautifulSoup

htmldoc = """
<tr>
    <td>Net Taxes Due</td>
    <td class="value-column">$2,370.00</td>
    <td class="value-column">$2,408.00</td>
</tr>
"""    
soup = BeautifulSoup(htmldoc, "html.parser")
item = soup.find('td',text='Net Taxes Due').find_next_sibling("td")
print(item)

回答2:

In the following I expected to use indices 1 and 2 but 2 and 3 seems to work when using lxml.html and xpath

import requests
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>

'''
tree = fromstring(h)
links = [link.text for link in tree.xpath('//td[text() = "Net Taxes Due"]/following-sibling::td[2] | //td[text() = "Net Taxes Due"]/following-sibling::td[3]' )]
print(links)

回答3:

Your .select() call is not correct. # in a selector is used to match an element's ID, not its text contents, so #Net means to look for an element with id="Net". Spaces in a selector mean to look for descendants that match each successive selector. So #Net Taxes Due searches for something like:

<div id="Net">
    <taxes>
        <due>...</due>
    </taxes>
</div>

To search for an element containing a specific string, use .find() with the string keyword:

table = soup.find(string="Net Taxes Due")

回答4:

Assuming that there's an actual HTML table involved:

<html>
<table>
<tr>
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
</tr>
</table>
</html>

soup = BeautifulSoup(url, "html.parser")
table = soup.find('tr')
df = [x.text for x in table.findAll('td', {'class':'value-column'})]

回答5:

These should work. If you are using bs4 4.7.0, you "could" use select. But if you are on an older version, or just prefer the find interface, you can use that. Basically as stated earlier, you cannot reference content with #, that is an ID.

import bs4

markup = """
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
"""

# Version 4.7.0
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.select('td:contains("Net Taxes Due") ~ td.value-column')
cells = [ele.text.strip() for ele in cells]
print(cells)

# Version < 4.7.0 or if you prefer find
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.find('td', text="Net Taxes Due").find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]
print(cells)

You would get this

['$2,370.00', '$2,408.00']
['$2,370.00', '$2,408.00']

来源：https://stackoverflow.com/questions/54045382/find-data-within-html-tags-using-python

标签

python

web-scraping

beautifulsoup