Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc.
Imagine everyone takes 3-5 classes. One of them is al
(1) To just get the biology grade only, it is almost one liner.
import bs4, re
soup = bs4.BeautifulSoup(html)
scores_string = soup.find_all(text=re.compile('Biology'))
scores = [score_string.split()[-1] for score_string in scores_string]
print scores_string
print scores
The output looks like this:
[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
[u'A+', u'B', u'B', u'B', u'B']
(2) You locate the tags and maybe for further tasks, you need to find the parent
:
import bs4, re
soup = bs4.BeautifulSoup(html)
scores = soup.find_all(text=re.compile('Biology'))
divs = [score.parent for score in scores]
print divs
Output looks like this:
[Biology A+,
Biology B,
Biology B,
Biology B,
Biology B]
*In conclusion, you can use find_siblings/parent/...etc to move around the HTML tree.*
More information about how to navigate the tree. And Good luck with your work.