soup | 易学教程

用Beautiful Soup抓取数据的小例子

阅读更多关于用Beautiful Soup抓取数据的小例子

【推荐阅读】微服务还能火多久？>>> 这是之前的一个小例子，现在记下。越来越觉得学习编程开发，不写笔记不写博客简直就是白学，所以现在把这些记下来。这个抓取的网站无需登录。主要是想记住几个bs函数的用法。代码如下： import urllib2 import re from BeautifulSoup import BeautifulSoup url="http://www.realestate.com.au/neighbourhoods/brendale-4500-qld" response=urllib2.urlopen(url) #获取网站源码 data=response.read() soup=BeautifulSoup(''.join(data)) #bs的用法，解析网站结构 a=soup.findAll('div',{'class':'slide-section median-price-subsections trend'},text=None) #find及findAll 在bs中特别有用。可以根据标签和属性找到相应目录 b=a[0].get('data-trend') #解析到的结果是一个数组，用get方法可以得到每一个条目的具体属性值 print b 来源： oschina 链接： https://my.oschina.net/u/1475074/blog

歌词爬虫

阅读更多关于歌词爬虫

因为要做对话聊天系统，需要大量的语料，所以决定用歌词作为训练数据试试，自己写了个爬虫，爬了大概23w首歌曲的歌词；用此歌词用作问答对，然后用LSTM-QA模型做问答匹配，经过多次实验，达到一个不错的效果，基本上可以跟你正常聊天； import re import urllib import urlparse from BeautifulSoup import BeautifulSoup url = u'http://www.lrcgc.com/' def find_singers(): singers_list = [] response = urllib.urlopen('http://www.lrcgc.com/artist-00.html') data = response.read() soup = BeautifulSoup(data) links = soup.findAll('a', href = re.compile(r'songlist.*.html')) for link in links: s = link.text l = link['href'] singers_list.append([s, l]) return singers_list def find_songs(singer): singer_name, urls_0 = singer[0],

代码小测试

阅读更多关于代码小测试

from bs4 import BeautifulSoup from lxml import html , etree file = 'hm.html' htmlfile = open ( file , 'r' , encoding = 'utf-8' ) htmlhandle = htmlfile . read ( ) soup = BeautifulSoup ( htmlhandle , features = 'lxml' ) #a=soup.text a = soup . find_all ( name = 'div' , attrs = { "class" : "p" } ) [ 0 ] . text #a = soup.select('') #print(a)#以上为内容爬取 #网页的url进行爬取 from bs4 import BeautifulSoup from lxml import html , etree file = 'hm.html' htmlfile = open ( file , 'r' , encoding = 'utf-8' ) htmlhandle = htmlfile . read ( ) soup = BeautifulSoup ( htmlhandle , features = 'lxml' ) #a = soup.find_all

python-爬虫-庆余年小说-词云胡乱分析

阅读更多关于 python-爬虫-庆余年小说-词云胡乱分析

进入正题，首先要搞到资源，我先去了搜索了一番，找到个网站“落霞”。一言不合就按下了F12，翻了下网页源码，超级简单。 from bs4 import BeautifulSoup from requests import Session from re import sub , DOTALL sess = Session ( ) txt = [ ] url = 'https://www.luoxia.com/qing/48416.htm' def find ( url ) : res = sess . get ( url ) soup = BeautifulSoup ( res . content , 'html.parser' ) title = soup . find ( 'title' ) div = soup . find ( 'div' , id = 'nr1' ) ps = div . find_all ( 'p' ) page = title . text + '\n' print ( page ) for p in ps : page += p . text + '\n' txt . append ( page ) try : a = soup . find ( 'a' , rel = 'next' ) href = a [ 'href' ] except :

python-爬虫-庆余年小说-词云胡乱分析

阅读更多关于 python-爬虫-庆余年小说-词云胡乱分析

真的不想再看见有谁未经许可也不标明出处搬运我的文章了，所以我自己先在博客园同步发一个。进入正题，首先要搞到资源，我先去了搜索了一番，找到个网站“落霞”。一言不合就按下了F12，翻了下网页源码，超级简单。 1 from bs4 import BeautifulSoup 2 from requests import Session 3 from re import sub,DOTALL 4 sess = Session() 5 txt=[] 6 url = 'https://www.luoxia.com/qing/48416.htm' 7 def find(url): 8 res = sess.get(url) 9 soup = BeautifulSoup(res.content,'html.parser') 10 title = soup.find('title') 11 div = soup.find('div',id='nr1') 12 ps = div.find_all('p') 13 page = title.text+'\n' 14 print(page) 15 for p in ps: 16 page += p.text+'\n' 17 txt.append(page) 18 try: 19 a = soup.find('a',rel='next') 20 href =

python——爬取网页（4）

阅读更多关于 python——爬取网页（4）

Beautiful Soup库的安装【安装】：使用管理员权限打开cmd 输入 pip install beautifulsoup4 3.测试Beautiful Soup 库是否安装完成示例网站：https://python123.io/ws/demo.html 页面源代码：（1）使用requests库提取网页 >> > import requests >> > r = requests . get ( "https://python123.io/ws/demo.html" ) >> > r . status_code 200 >> > r . encoding = r . apparent_encoding >> > r . text '<html><head><title>This is a python demo page</title></head>\r\n<body>\r\nThe demo python introduces several python courses.\r\nPython is a wonderful general-purpose programming language. You can learn Python from novice to

Pyhton网页爬虫抓取学习（一） BeautifulSoup的使用

阅读更多关于 Pyhton网页爬虫抓取学习（一） BeautifulSoup的使用

<table id="table" width="100%" border="0" align="center" cellpadding="0" cellspacing="0" class="from_w"> <tbody><tr> <td width="20%" align="right" class="tdlable">1</td> <td width="30%" align="left" class="tdvalue">2</td> <td width="20%" align="right" class="tdlable">1* </td> <td width="30%" align="left" class="tdvalue">2 </td> </tr> <tr class="evenRow"> <td align="right" class="tdlable">2* </td> <td class="tdvalue"></td> <td align="right" class="tdlable">1* </td> <td class="tdvalue">1</td> </tr> <tr> <td

初尝爬取58同城车辆信息

阅读更多关于初尝爬取58同城车辆信息

爬取58同城二手车信息，小白学python最缺的就是成就感，那就从简易爬虫开始吧……代码略显冗余 #! python3 import requests, time, openpyxl from bs4 import BeautifulSoup header = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3626.400 QQBrowser/10.4.3211.400' } def get_car_links(url): #定义函数获取车辆url car_links = [] res = requests.get(url,headers=header) res.raise_for_status() soup = BeautifulSoup(res.text,'html.parser') links = soup.select('h5 > a') for link in links: car_links.append(link.get('href')) return car_links def get_car_info(): #定义函数获取车辆信息info

基于BeautifulSoup的58同城的南山区租房信息爬取

阅读更多关于基于BeautifulSoup的58同城的南山区租房信息爬取

# -*- coding: utf-8 -*- import requests import re from bs4 import BeautifulSoup #导入bs4中的BeautifulSoup import pymysql db = pymysql.connect(host='localhost', user='root', password='mysql123', db='58tc', charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor) # cursor.execute("DROP TABLE IF EXISTS employee") 创建mysql的table,因为我早创建好了，这段注释掉了 # sql1 = """CREATE TABLE employee( # AREA VARCHAR(20) , # SPACE VARCHAR(20), # TIME VARCHAR(20), # PRINCE VARCHAR(20), # HREF VARCHAR(300))""" # cursor.execute(sql1) # 使用cursor()方法获取操作游标 cursor = db.cursor() sql = "INSERT INTO EX(`AREA`,`SPACE`,`TIME`,`PRINCE`

python爬取58同城租房信息

阅读更多关于 python爬取58同城租房信息

代码： # coding=utf-8 import sys import csv import requests from bs4 import BeautifulSoup ''' 遇到不懂的问题？Python学习交流群：821460695满足你的需求，资料都已经上传群文件，可以自行下载！ ''' reload(sys) sys.setdefaultencoding('utf-8') # 请求头设置 def download(url): db_data = requests.get(url) soup = BeautifulSoup(db_data.text, 'lxml') titles = soup.select( 'body > div.mainbox > div.main > div.content > div.listBox > ul > li > div.des > h2 > a:nth-of-type(1)') houses = soup.select('body > div.mainbox > div.main > div.content > div.listBox > ul > li > div.des > p.room') oneaddresss = soup.select( 'body > div.mainbox > div.main > div

订阅 soup