soup | 易学教程

爬虫之解析库beautiful soup

阅读更多关于爬虫之解析库beautiful soup

一:简介来源： https://www.cnblogs.com/SR-Program/p/11937617.html

beautifulsoup的语法： find_all方法：搜索出所有满足要求的结点 find方法：搜索出满足要求的第一个结点搜索结点： find_all，find 方法：find_all(name,attrs,string); #结点的名字，结点的属性，结点的文字； #查找所有标签为a的结点： soup.find_all(‘a’); #查找所有标签为a，连接符合/view/123.html形式的结点 soup.find_all(‘a’,href=’/view/123.html’) soup.find_all(‘a’,href=re.compile(r’/view/123.html’)) #查找所有标签为div，class为abc，文字为python的结点： soup.find all(‘div’,class =’abc’,string=”python”); 例如： 1234567 from bs4 import BeautifulSoupsoup=BeautifulSoup(html_doc, html.parser' #HTML解析器 from_encoding='utf-8' #HTML文档的编码 ) ###（2）访问节点的信息：得到节点 python #获取查找的结点的标签名称： node.name; #获取查找到的a结点的href属性： node[‘href’];

爬取漫画DB上的JoJo的奇妙冒险第七部飙马野郎

阅读更多关于爬取漫画DB上的JoJo的奇妙冒险第七部飙马野郎

SBR是JOJO系列我最喜欢的一部，所以今天把漫画爬取到本地，日后慢慢看。 import os import re import time import requests from bs4 import BeautifulSoup from requests import RequestException def get_page(url): try: headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36' + '(KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'} response = requests.get(url, headers=headers) if response.status_code == 200: return response.text else: return None return None except RequestException: return None def get_pagesNumber(text): soup = BeautifulSoup(text, 'lxml') pagesNumber = soup.find(name='div', class_="d

Python爬取酷狗TOP100

阅读更多关于 Python爬取酷狗TOP100

import time import requests from bs4 import BeautifulSoup headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' } def get_info ( url ) : req = requests.get ( url = url, headers = headers ) soup = BeautifulSoup ( req.text, 'lxml' ) ranks = soup.select ( '.pc_temp_num' ) titles = soup.select ( '.pc_temp_songlist > ul > li > a' ) times = soup.select ( '.pc_temp_time' ) for rank,title,time in zip ( ranks,titles,times ) : data = { 'rank' : rank.get_text () .strip () , 'title' : title.get_text () .split ( '-' ) [ 1

Python：bs4的使用

阅读更多关于 Python：bs4的使用

from bs4 import BeautifulSoup soup = BeautifulSoup("<html>A Html Text</html>", "html.parser") 　 lxml HTML BeautifulSoup(html, "lxml") lxml XML BeautifulSoup(html, ["lxml", "xml"]) BeautifulSoup(html, "xml") html5lib BeautifulSoup(html, "html5lib") soup.prettify() # prettify 有括号和没括号都可以 soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') tag = soup.b type(tag) # <class 'bs4.element.Tag'> Name tag.name # 'b' Attributes tag['class'] # 'boldest' tag.attrs # {'class': 'boldest'} type(tag.attrs) # <class 'dict'> soup = BeautifulSoup('<p class="body strikeout"></p>') print(soup.p['class']) #

test

阅读更多关于 test

test\ # -*- coding: utf-8 -*- # @Time ：2019/10/14 20:45 # Author ：李成广(63) # @Email ：chengguang.li@dili.com # @File ：Spider.py # @Brief ：爬虫主程序 import requests from bs4 import BeautifulSoup spider_url='https://www.doutula.com/photo/list/?page=1' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36' } page1 = requests.get(spider_url,headers=headers) soup = BeautifulSoup(page1.text, "html.parser") print(soup) div = soup.find(name='div', attrs={'class': 'page-content text-center'}) print(div) div2 = div.find(name=

爬取小黄文

阅读更多关于爬取小黄文

# coding=utf-8 import requests from bs4 import BeautifulSoup import time from multiprocessing import Pool import threading from requests.adapters import HTTPAdapter rs = requests.Session() rs.mount('http://', HTTPAdapter(max_retries=30)) rs.mount('https://', HTTPAdapter(max_retries=30)) # monkey.patch_all() # class MyThread(threading.Thread): # """重写多线程，使其能够返回值""" # def __init__(self, target=None, args=()): # super(MyThread, self).__init__() # self.func = target # self.args = args # # def run(self): # self.result = self.func(*self.args) # # def get_result(self): # try: # return self.result #

【python爬虫】class和class_

阅读更多关于【python爬虫】class和class_

在使用 BeautifulSoup 库的 find_all() 方法定位所需要的元素时，当匹配条件为 class 时，会编译报错：这时候应该使用 class_ 就不报错了。 soup.find_all( 'div' , class _ = 'iimg-box-meta' ) 原因： class 在 python 中是关键保留字，不能再将这些字作为变量名或过程名使用，所以 class_ 应运而生。 python中共有35个保留关键字 1 2 3 4 5 False True None and break as assert async await class continue def yield del elif else except finally for from global if import in is lambda nonlocal not or pass raise return try while with import requests from bs4 import BeautifulSoup headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537