BeautifulSoup different parsers

前端未结

关注

 1  899

could anyone elaborate more about the difference between parsers like html.parser and html5lib? I\'ve stumbled across a weird behavior where when using html.parser it ignore

相关标签:

1条回答

闹比i

2021-01-20 00:12

You can use lxml which is very fast and can use find_all or select to get all tags.

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
print(tags)

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)

0 讨论(0)