淘宝商品比价定向爬虫
目标:获取淘宝搜索页面的信息,提取其中的商品名称和价格。
分析淘宝链接(以搜索“书包”):
起始页:
https://s.taobao.com/search?q=书包&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20200208&ie=utf8
第2页:
https://s.taobao.com/search?q=书包&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20200208&ie=utf8&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=44
结尾出现 s=44
第3页:
https://s.taobao.com/search?q=书包&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20200208&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=1%2C48&s=88
结尾出现 s=88
故猜测s=本页第一个商品的编号
#CrowTaobaoPrice.py
import requests
import re
def getHTMLText(url):
kv = {'cookie':'cna=gqooFT4fl1oCAbf3sQXLxrR0; thw=cn; hng=CN%7Czh-CN%7CCNY%7C156; tracknick=tb046952766; tg=0; x=e%3D1%26p%3D*%26s%3D0%26c%3D0%26f%3D0%26g%3D0%26t%3D0; miid=1315433990874482444; enc=asdA190i8zhK6XL9DxlQLn8UMgMGRtJ22re5ME9FCCQmIcH5WceP4yfjI4jPhjWkw31POluByeYviC0781gSxw%3D%3D; t=e8d646ac008863859cb76b3c6f31eed0; uc3=vt3=F8dBxdrF9id9QRrqNW4%3D&nk2=F5RFhBjpmfo539A%3D&lg2=UIHiLt3xD8xYTw%3D%3D&id2=VyyX7m4K70GdMw%3D%3D; lgc=tb046952766; uc4=id4=0%40VXtYgtaXdpABZNMxYAhkynmIHL4I&nk4=0%40FY4O7bfdQWK9lNbvTAUNYYR8JkRjjQ%3D%3D; _cc_=V32FPkk%2Fhw%3D%3D; mt=ci=-1_0; v=0; cookie2=1a87d1e434536a72345a7f18ea31a876; _tb_token_=e1761d856b3e3; alitrackid=www.taobao.com; lastalitrackid=www.taobao.com; JSESSIONID=FB23427EF5495F6FA58AF2DCE70EA371; l=cBjqLxAPv-vf-QB9BOfZZuI8Lz7TmQAfGsPzw4OMNICP_u5P8wGFWZ0FIDY2CnGVL6JDR3rCLlQpB28EuyUCdbWSHQLCgsDd.; isg=BPf3kWdeZfAwYuPhW9PZY1TrhutBvMsetWMICUmmGEaK-B86UY3Tbyoa3limEKOW; uc1=cookie14=UoTUO8HHNFm9Gw%3D%3D','user-agent':'Mozilla/5.0'}
try:
r = requests.get(url, headers=kv,timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def parsePage(ilt, html):
try:
plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
for i in range(len(plt)):
price = eval(plt[i].split(':')[1])
title = eval(tlt[i].split(':')[1])
ilt.append([price , title])
except:
print("")
def printGoodsList(ilt):
tplt = "{:4}\t{:8}\t{:16}"
print(tplt.format("序号", "价格", "商品名称"))
count = 0
for g in ilt:
count = count + 1
print(tplt.format(count, g[0], g[1]))
def main():
goods = '书包'
depth = 3
start_url = 'https://s.taobao.com/search?q=' + goods
infoList = []
for i in range(depth):
try:
url = start_url + '&s=' + str(44*i)
html = getHTMLText(url)
parsePage(infoList, html)
except:
continue
printGoodsList(infoList)
main()
输出结果:
序号 价格 商品名称
1 49.00 镇店之宝 买就送笔袋文具12件套 还等什么亲
2 88.00 迪士尼书包小学生女童1-3-6一三年级公主女孩轻便儿童减负双肩包5
3 98.00 迪士尼书包小学生女童1-3-6年级冰雪公主女孩轻便儿童休闲双肩包8
4 29.00 小米双肩包米家小背包男女通用运动包日常休闲双肩包学生书包
5 189.00 小米双肩包商务旅行背包大容量书包男士时尚多功能笔记本电脑包
...
136 699.00 鳄鱼男士双肩包真皮大容量商务休闲旅行电脑背包时尚潮流牛皮书包
参考文章:
https://blog.csdn.net/weixin_43173093/article/details/87716555
来源:CSDN
作者:newnewrookie
链接:https://blog.csdn.net/qq_26059615/article/details/104222946