soup

用Beautiful Soup抓取数据的小例子

社会主义新天地 提交于 2020-04-15 19:24:10
【推荐阅读】微服务还能火多久?>>> 这是之前的一个小例子,现在记下。越来越觉得学习编程开发,不写笔记不写博客简直就是白学,所以现在把这些记下来。这个抓取的网站无需登录。主要是想记住几个bs函数的用法。 代码如下: import urllib2 import re from BeautifulSoup import BeautifulSoup url="http://www.realestate.com.au/neighbourhoods/brendale-4500-qld" response=urllib2.urlopen(url) #获取网站源码 data=response.read() soup=BeautifulSoup(''.join(data)) #bs的用法,解析网站结构 a=soup.findAll('div',{'class':'slide-section median-price-subsections trend'},text=None) #find及findAll 在bs中特别有用。可以根据标签和属性找到相应目录 b=a[0].get('data-trend') #解析到的结果是一个数组,用get方法可以得到每一个条目的具体属性值 print b 来源: oschina 链接: https://my.oschina.net/u/1475074/blog

歌词爬虫

半城伤御伤魂 提交于 2020-02-26 09:16:11
因为要做对话聊天系统,需要大量的语料,所以决定用歌词作为训练数据试试,自己写了个爬虫,爬了大概23w首歌曲的歌词; 用此歌词用作问答对,然后用LSTM-QA模型做问答匹配,经过多次实验,达到一个不错的效果,基本上可以跟你正常聊天; import re import urllib import urlparse from BeautifulSoup import BeautifulSoup url = u'http://www.lrcgc.com/' def find_singers(): singers_list = [] response = urllib.urlopen('http://www.lrcgc.com/artist-00.html') data = response.read() soup = BeautifulSoup(data) links = soup.findAll('a', href = re.compile(r'songlist.*.html')) for link in links: s = link.text l = link['href'] singers_list.append([s, l]) return singers_list def find_songs(singer): singer_name, urls_0 = singer[0],

代码小测试

我的未来我决定 提交于 2020-02-08 01:43:39
from bs4 import BeautifulSoup from lxml import html , etree file = 'hm.html' htmlfile = open ( file , 'r' , encoding = 'utf-8' ) htmlhandle = htmlfile . read ( ) soup = BeautifulSoup ( htmlhandle , features = 'lxml' ) #a=soup.text a = soup . find_all ( name = 'div' , attrs = { "class" : "p" } ) [ 0 ] . text #a = soup.select('') #print(a)#以上为内容爬取 #网页的url进行爬取 from bs4 import BeautifulSoup from lxml import html , etree file = 'hm.html' htmlfile = open ( file , 'r' , encoding = 'utf-8' ) htmlhandle = htmlfile . read ( ) soup = BeautifulSoup ( htmlhandle , features = 'lxml' ) #a = soup.find_all

python-爬虫-庆余年小说-词云胡乱分析

孤人 提交于 2020-01-31 02:45:42
进入正题,首先要搞到资源,我先去了搜索了一番,找到个网站“落霞”。一言不合就按下了F12,翻了下网页源码,超级简单。 from bs4 import BeautifulSoup from requests import Session from re import sub , DOTALL sess = Session ( ) txt = [ ] url = 'https://www.luoxia.com/qing/48416.htm' def find ( url ) : res = sess . get ( url ) soup = BeautifulSoup ( res . content , 'html.parser' ) title = soup . find ( 'title' ) div = soup . find ( 'div' , id = 'nr1' ) ps = div . find_all ( 'p' ) page = title . text + '\n' print ( page ) for p in ps : page += p . text + '\n' txt . append ( page ) try : a = soup . find ( 'a' , rel = 'next' ) href = a [ 'href' ] except :

python-爬虫-庆余年小说-词云胡乱分析

▼魔方 西西 提交于 2020-01-30 22:02:35
真的不想再看见有谁未经许可也不标明出处搬运我的文章了,所以我自己先在博客园同步发一个。 进入正题,首先要搞到资源,我先去了搜索了一番,找到个网站“落霞”。一言不合就按下了F12,翻了下网页源码,超级简单。 1 from bs4 import BeautifulSoup 2 from requests import Session 3 from re import sub,DOTALL 4 sess = Session() 5 txt=[] 6 url = 'https://www.luoxia.com/qing/48416.htm' 7 def find(url): 8 res = sess.get(url) 9 soup = BeautifulSoup(res.content,'html.parser') 10 title = soup.find('title') 11 div = soup.find('div',id='nr1') 12 ps = div.find_all('p') 13 page = title.text+'\n' 14 print(page) 15 for p in ps: 16 page += p.text+'\n' 17 txt.append(page) 18 try: 19 a = soup.find('a',rel='next') 20 href =

python——爬取网页(4)

走远了吗. 提交于 2020-01-29 13:45:13
Beautiful Soup库的安装 【安装】: 使用管理员权限打开cmd 输入 pip install beautifulsoup4 3.测试Beautiful Soup 库是否安装完成 示例网站:https://python123.io/ws/demo.html 页面源代码: (1)使用requests库提取网页 >> > import requests >> > r = requests . get ( "https://python123.io/ws/demo.html" ) >> > r . status_code 200 >> > r . encoding = r . apparent_encoding >> > r . text '<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to

Pyhton网页爬虫抓取学习(一) BeautifulSoup的使用

风流意气都作罢 提交于 2020-01-28 07:36:49
<table id="table" width="100%" border="0" align="center" cellpadding="0" cellspacing="0" class="from_w"> <tbody><tr> <td width="20%" align="right" class="tdlable">1</td> <td width="30%" align="left" class="tdvalue">2</td> <td width="20%" align="right" class="tdlable">1<font color="#FF0000">*</font> </td> <td width="30%" align="left" class="tdvalue">2 </td> </tr> <tr class="evenRow"> <td align="right" class="tdlable">2<font color="#FF0000">*</font> </td> <td class="tdvalue"></td> <td align="right" class="tdlable">1<font color="#FF0000">*</font> </td> <td class="tdvalue">1</td> </tr> <tr> <td

初尝爬取58同城车辆信息

落爺英雄遲暮 提交于 2019-12-08 22:02:46
爬取58同城二手车信息,小白学python最缺的就是成就感,那就从简易爬虫开始吧……代码略显冗余 #! python3 import requests, time, openpyxl from bs4 import BeautifulSoup header = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3626.400 QQBrowser/10.4.3211.400' } def get_car_links(url): #定义函数获取车辆url car_links = [] res = requests.get(url,headers=header) res.raise_for_status() soup = BeautifulSoup(res.text,'html.parser') links = soup.select('h5 > a') for link in links: car_links.append(link.get('href')) return car_links def get_car_info(): #定义函数获取车辆信息info

基于BeautifulSoup的58同城的南山区租房信息爬取

折月煮酒 提交于 2019-12-08 22:02:11
# -*- coding: utf-8 -*- import requests import re from bs4 import BeautifulSoup #导入bs4中的BeautifulSoup import pymysql db = pymysql.connect(host='localhost', user='root', password='mysql123', db='58tc', charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor) # cursor.execute("DROP TABLE IF EXISTS employee") 创建mysql的table,因为我早创建好了,这段注释掉了 # sql1 = """CREATE TABLE employee( # AREA VARCHAR(20) , # SPACE VARCHAR(20), # TIME VARCHAR(20), # PRINCE VARCHAR(20), # HREF VARCHAR(300))""" # cursor.execute(sql1) # 使用cursor()方法获取操作游标 cursor = db.cursor() sql = "INSERT INTO EX(`AREA`,`SPACE`,`TIME`,`PRINCE`

python爬取58同城租房信息

不羁的心 提交于 2019-12-08 21:59:25
代码: # coding=utf-8 import sys import csv import requests from bs4 import BeautifulSoup ''' 遇到不懂的问题?Python学习交流群:821460695满足你的需求,资料都已经上传群文件,可以自行下载! ''' reload(sys) sys.setdefaultencoding('utf-8') # 请求头设置 def download(url): db_data = requests.get(url) soup = BeautifulSoup(db_data.text, 'lxml') titles = soup.select( 'body > div.mainbox > div.main > div.content > div.listBox > ul > li > div.des > h2 > a:nth-of-type(1)') houses = soup.select('body > div.mainbox > div.main > div.content > div.listBox > ul > li > div.des > p.room') oneaddresss = soup.select( 'body > div.mainbox > div.main > div