soup

爬虫之BeautifulSoup

对着背影说爱祢 提交于 2019-12-05 10:41:14
beautifulsoup的语法: find_all方法:搜索出所有满足要求的结点 find方法:搜索出满足要求的第一个结点 搜索结点: find_all,find 方法:find_all(name,attrs,string); #结点的名字,结点的属性,结点的文字; #查找所有标签为a的结点: soup.find_all(‘a’); #查找所有标签为a,连接符合/view/123.html形式的结点 soup.find_all(‘a’,href=’/view/123.html’) soup.find_all(‘a’,href=re.compile(r’/view/123.html’)) #查找所有标签为div,class为abc,文字为python的结点: soup.find all(‘div’,class =’abc’,string=”python”); 例如: 1234567 from bs4 import BeautifulSoupsoup=BeautifulSoup(html_doc, html.parser' #HTML解析器 from_encoding='utf-8' #HTML文档的编码 ) ###(2)访问节点的信息: 得到节点 python #获取查找的结点的标签名称: node.name; #获取查找到的a结点的href属性: node[‘href’];

爬取漫画DB上的JoJo的奇妙冒险 第七部 飙马野郎

笑着哭i 提交于 2019-12-03 00:06:47
SBR是JOJO系列我最喜欢的一部,所以今天把漫画爬取到本地,日后慢慢看。 import os import re import time import requests from bs4 import BeautifulSoup from requests import RequestException def get_page(url): try: headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36' + '(KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'} response = requests.get(url, headers=headers) if response.status_code == 200: return response.text else: return None return None except RequestException: return None def get_pagesNumber(text): soup = BeautifulSoup(text, 'lxml') pagesNumber = soup.find(name='div', class_="d

Python爬取酷狗TOP100

匿名 (未验证) 提交于 2019-12-02 22:56:40
import time import requests from bs4 import BeautifulSoup headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' } def get_info ( url ) : req = requests.get ( url = url, headers = headers ) soup = BeautifulSoup ( req.text, 'lxml' ) ranks = soup.select ( '.pc_temp_num' ) titles = soup.select ( '.pc_temp_songlist > ul > li > a' ) times = soup.select ( '.pc_temp_time' ) for rank,title,time in zip ( ranks,titles,times ) : data = { 'rank' : rank.get_text () .strip () , 'title' : title.get_text () .split ( '-' ) [ 1

Python:bs4的使用

匿名 (未验证) 提交于 2019-12-02 22:11:45
from bs4 import BeautifulSoup soup = BeautifulSoup("<html>A Html Text</html>", "html.parser")   lxml HTML BeautifulSoup(html, "lxml") lxml XML BeautifulSoup(html, ["lxml", "xml"]) BeautifulSoup(html, "xml") html5lib BeautifulSoup(html, "html5lib") soup.prettify() # prettify 有括号和没括号都可以 soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') tag = soup.b type(tag) # <class 'bs4.element.Tag'> Name tag.name # 'b' Attributes tag['class'] # 'boldest' tag.attrs # {'class': 'boldest'} type(tag.attrs) # <class 'dict'> soup = BeautifulSoup('<p class="body strikeout"></p>') print(soup.p['class']) #

test

蓝咒 提交于 2019-12-01 08:52:46
test\ # -*- coding: utf-8 -*- # @Time :2019/10/14 20:45 # Author :李成广(63) # @Email :chengguang.li@dili.com # @File :Spider.py # @Brief :爬虫主程序 import requests from bs4 import BeautifulSoup spider_url='https://www.doutula.com/photo/list/?page=1' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36' } page1 = requests.get(spider_url,headers=headers) soup = BeautifulSoup(page1.text, "html.parser") print(soup) div = soup.find(name='div', attrs={'class': 'page-content text-center'}) print(div) div2 = div.find(name=

爬取小黄文

依然范特西╮ 提交于 2019-11-27 18:28:54
# coding=utf-8 import requests from bs4 import BeautifulSoup import time from multiprocessing import Pool import threading from requests.adapters import HTTPAdapter rs = requests.Session() rs.mount('http://', HTTPAdapter(max_retries=30)) rs.mount('https://', HTTPAdapter(max_retries=30)) # monkey.patch_all() # class MyThread(threading.Thread): # """重写多线程,使其能够返回值""" # def __init__(self, target=None, args=()): # super(MyThread, self).__init__() # self.func = target # self.args = args # # def run(self): # self.result = self.func(*self.args) # # def get_result(self): # try: # return self.result #

【python爬虫】class和class_

心已入冬 提交于 2019-11-27 08:58:31
在使用 BeautifulSoup 库的 find_all() 方法定位所需要的元素时,当匹配条件为 class 时,会编译报错: 这时候应该使用 class_ 就不报错了。 soup.find_all( 'div' , class _ = 'iimg-box-meta' ) 原因: class 在 python 中是关键保留字,不能再将这些字作为变量名或过程名使用,所以 class_ 应运而生。 python中共有35个保留关键字 1 2 3 4 5 False True None and break as assert async await class continue def yield del elif else except finally for from global if import in is lambda nonlocal not or pass raise return try while with import requests from bs4 import BeautifulSoup headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537