i am new to Python and to Beatiful Soup also! I heard about BS. It is told to be a great tool to parse and extract content. So here i am...:
I want to take the content
Use "text" to get text between "td"
1) First read table DOM using tag or ID
soup = BeautifulSoup(self.driver.page_source, "html.parser")
htnm_migration_table = soup.find("table", {'id':'htnm_migration_table'})
2) Read tbody
tbody = htnm_migration_table.find('tbody')
3) Read all tr from tbody tag
trs = tbody.find_all('tr')
4) get all tds using tr
for tr in trs:
tds = tr.find_all('td')
for td in tds:
print(td.text)
I find Beautiful Soup very efficient tool so keep learning it :-) It is able to parse a page with invalid markup so it should be able to handle the page you refer. You may want to use command BeautifulSoup(html).prettify()
command if you want to get a valid reformatted page source with valid markup.
As for your question, the result of your first soup.findAll(...)
command is also a Beautiful Soup object and you can make a second search in it, like this:
table_soup = soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'})
your_sample_text = table_soup.find("td").renderContents().strip()
print your_sample_text
First find the table (as you are doing). Using find
rather than findall
returns the first item in the list (rather than returning a list of all finds - in which case we'd have to add an extra [0]
to take the first element of the list):
table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
Then use find
again to find the first td
:
first_td = table.find('td')
Then use renderContents()
to extract the textual contents:
text = first_td.renderContents()
... and the job is done (though you may also want to use strip()
to remove leading and trailing spaces:
trimmed_text = text.strip()
This should give:
>>> print trimmed_text
This is a sample text
>>>
as desired.