问题
In this answer, @Andrej Kesely use the following code to remove unnecessary elements (ads, huge space,...) from html of this url.
import requests
from bs4 import BeautifulSoup
url = 'https://www.collinsdictionary.com/dictionary/french-english/aimer'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for script in soup.select('script, .hcdcrt, #ad_contentslot_1, #ad_contentslot_2'):
script.extract()
print(soup.h2.text)
print(''.join(map(str, soup.select_one('.hom').contents)))
It seems to me that those unnecessary elements are marked by script, .hcdcrt, #ad_contentslot_1, #ad_contentslot_2
.
Could you please elaborate how to look at the html structure (by pressing F12) to pin down them?
回答1:
@bigbounty's comment solves my problem. I post it here to remove my question from unanswered list.
One way is Right Click on chrome and visualize the html DOM using livedom.validator.nu or any other online service
来源:https://stackoverflow.com/questions/63109765/how-to-determine-these-elements-of-html