BeautifulSoup---学习 | 易学教程

BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库，它的使用方式相对于正则来说更加的简单方便，常常能够节省我们大量的时间。

官方中文文档的：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

以下进行一些总结。

可用的解析器

　　以下是主要的几种解析器：

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"])``BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

有的时候，lxml需要单独安装：

pip install lxml

安装成功后，在解析网页的时候，指定为lxml即可，使用方法如下：

soup = BeautifulSoup(html_doc, 'lxml')

对象种类

　　（1）Tag：

　　tag就是标签的意思，如下：

>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
>>> tag = soup.b
>>> type(tag)
<class 'bs4.element.Tag'>

Tag就是html中的标签的意思，包括很多的方法和属性：

　　.name

>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
>>> tag = soup.b
>>> type(tag)
<class 'bs4.element.Tag'>
>>> tag.name
'b'

　　Attributes(属性)

　　可以像访问字典一样访问节点的属性：

>>> soup = BeautifulSoup('<p class="body strikeout"></p>')
>>> soup.p['class']
["body", "strikeout"] # 多值属性会采用list的方式返回

　　同时属性也可以通过.attrs的方式进行获取

　　返回的值往往是列表或者字符串的形式。

>>>tag.attrs
# {u'class': u'boldest'}
>>>tag.attrs["class"]
# ["boldest"]

　　.get_text()

　　通过get_text()方法我们可以获取某个tag下所有的文本内容。

　　与 .text 相同。

In [1]: soup.body.get_text() Out[1]: "The Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"

　　(2)NavigableString

　　NavigableString的意思是可以遍历的字符串，一般被标签包裹在其中的的文本就是NavigableString格式。

In [1]: soup = BeautifulSoup('<p>No longer bold</p>') 
In [2]: soup.p.string Out[2]: 'No longer bold' 
In [3]: type(soup.p.string) 
Out[3]: bs4.element.NavigableString

　　(3)BeautifulSoup

　　BeautifulSoup对象就是解析网页获得的对象。

　　(4)Comment

　　Comment指的是在网页中的注释以及特殊字符串。

遍历文档树

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.

注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点

子节点

　　(1)获得标签节点，通过标签的名字

　　如此可以类似与路径的形式进行定位 soup.body.p.a

>>>soup.head
# <head><title>The Dormouse's story</title></head>

>>>soup.title
# <title>The Dormouse's story</title>

　　同时，也可以采用 .find_all("a") 的方式获得所有的a标签

　　　　　　　　　　 .find("a") 获得第一个a标签

　　（2）.contents和.children

　　.contents获取节点的所有子节点，包括里面的NavigableString对象。以列表方式输出

In [1]: soup.head.contents
Out[1]: [<title>The Dormouse's story</title>]

　　而通过.children同样的是获取某个节点的所有子节点，但是返回的是一个迭代器，这种方式会比列表格式更加的节省内存。

In [1]: tags = soup.head.children 
In [2]: tags Out[2]: <list_iterator at 0x110f76940> 
In [3]: for tag in tags: 
        　　print(tag) <title>The Dormouse's story</title>

　　（3）.descendants

　　　　通过descendants可以获得所有子孙节点，返回的结果跟children一样，需要迭代或者转类型使用。

　 （4）string，strings，stripped_strings

　　　如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html><b class="boldest" name="ven">Extremely bold</b></html>', 'lxml')
print(soup.string)
# Extremely bold

　　　而如果这个节点中有多个字符串的时候，BeautifulSoup就无法确定要取出哪个字符串了，这时候需要使用strings。(.strings返回的是迭代类型，需要遍历或者使用list)

In [1]: list(soup.body.strings) 
Out[1]: ["The Dormouse's story",
 '\n',
 'Once upon a time there were three little sisters; and their names were\n', 'Elsie',
 ',\n', 
'Lacie', 
' and\n', 
'Tillie', 
';\nand they lived at the bottom of a well.', 
'\n', 
'...', 
'\n']

父节点

　（1）.parent

　　有时我们也需要去获取某个节点的父节点，也就是包裹着当前节点的节点。

In [1]: soup.b.parent
Out[1]: <p class="title"><b>The Dormouse's story</b></p>

　（2）.parents

　　而使用parents则可以获得当前节点递归到顶层的所有父辈元素。

In [1]: [i.name for i in soup.b.parents]
Out[1]: ['p', 'body', 'html', '[document]']

兄弟节点

　　兄弟节点指的就是父节点相同的节点。

　　（1）next_sibling 和 previous_sibling

　　　兄弟节点选取的方法与当前节点的位置有关，next_sibling获取的是当前节点的下一个兄弟节点，previous_sibling获取的是当前节点的上一个兄弟节点。所以，兄弟节点中排第一个的节点是没有previous_sibling的，最后一个节点是没有next_sibling的。

　　（2）next_siblings 和 previous_siblings

　　相对应的，next_siblings获取的是下方所有的兄弟节点，previous_siblings获取的上方所有的兄弟节点。

　　（3）next_element(s) 和 previous_element(s)

　　pass

搜索文档树

　　（1）find()和find_all()

　　.find()获取第一个，.find_all()获取所有。

　　find(),find_all()方法可以这样用（soup.find("a") ===>soup("a")）

　　通过name搜索

>>> soup.find_all('b')
[<b>The Dormouse's story</b>]
>>> soup.find_all(["a", "b"])
[<b>The Dormouse's story</b>, 
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

　　通过属性搜索

In [1]: soup.find_all(attrs={'class': 'sister'}) 
In [2]: soup.find_all({'class': 'sister'}) 
In [3]: soup.find_all("a", {'class': 'sister'}) 
Out[1]:
 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

　　通过文本搜索

　　text必须完全符合才匹配到。

　　并且可以配合标签找到标签

>>> soup.find_all(text="Elsie")
[u'Elsie']
>>> soup.find_all(text=["Tillie", "Elsie", "Lacie"])
[u'Elsie', u'Lacie', u'Tillie']

>>> soup.find_all("a", text="Tillie")
[<a>Tillie</>]

　　限制查找范围为子节点

　　find_all()方法会默认的去所有的子孙节点中搜索，而如果将recursive参数设置为False，则可以将搜索范围限制在直接子节点中。

>>> soup.html.find_all("title")
[<title>The Dormouse's story</title>]

>>> soup.html.find_all("title", recursive=False)
[]

　　限制查找的结果数（limit）

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

　　正则表达式筛选查找结果

　　在BeautifulSoup中，也是可以与re模块进行相互配合的，将re.compile编译的对象传入find_all()方法，即可通过正则来进行搜索。

In [1]: import re
In [2]: tags = soup.find_all(re.compile("^b"))
In [3]: [i.name for i in tags]
Out[3]: ['body', 'b']

　　可以看到，找到了标签名是以'b'开头的两个标签。

　　同样的，也能够以正则来筛选tag的属性。

In [1]: soup.find_all(attrs={'class': re.compile("si")})

In [2]: soup.find_all(class_: re.compile("si")})

Out[1]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> ]

CSS选择器

　　在BeautifulSoup中，同样也支持使用CSS选择器来进行搜索。使用select()，在其中传入字符串参数，就可以使用CSS选择器的语法来找到tag。

>>> soup.select("title") 
[<title>The Dormouse's story</title>] 
>>> soup.select("p > a") 
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

总结

　　主要看了搜索和遍历，修改文档树没怎么用到，就没有看了。

来源：https://www.cnblogs.com/pyven/p/9244842.html

标签

sister

lxml

lacie