How to determine these elements of html?

偶尔善良 提交于 2020-08-10 20:49:30

问题


In this answer, @Andrej Kesely use the following code to remove unnecessary elements (ads, huge space,...) from html of this url.

import requests
from bs4 import BeautifulSoup

url = 'https://www.collinsdictionary.com/dictionary/french-english/aimer'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for script in soup.select('script, .hcdcrt, #ad_contentslot_1, #ad_contentslot_2'):
    script.extract()

print(soup.h2.text)
print(''.join(map(str, soup.select_one('.hom').contents)))

It seems to me that those unnecessary elements are marked by script, .hcdcrt, #ad_contentslot_1, #ad_contentslot_2.

Could you please elaborate how to look at the html structure (by pressing F12) to pin down them?


回答1:


@bigbounty's comment solves my problem. I post it here to remove my question from unanswered list.

One way is Right Click on chrome and visualize the html DOM using livedom.validator.nu or any other online service



来源:https://stackoverflow.com/questions/63109765/how-to-determine-these-elements-of-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!