https://blog.csdn.net/xiaoduan_/article/details/80835231
在拉勾网发现他们招聘信息的返回接口是json接口,有这样好的数据接口怎么能不爬那。
平时比较喜欢spark,那就来爬spark的招聘信息然后放到MongoDB里面吧
#!/usr/bin/env python3 # -*- coding: utf-8 -*- # @Author : Anthony_Duan # @Time : 25/06/2018 15:53 # @File : lagou.py # @Software: PyCharm import requests from fake_useragent import UserAgent import time from pymongo import MongoClient client = MongoClient() db = client.lagou # 连接数据库,如果没有该数据库就创建一个 my_set = db.spark_job # 定义lagou数据库下的job表 没有自动创建 headers = { "Cookie": "JSESSIONID=ABAAABAAAIAACBICB3D046BA1BEA314A00EA18BD6391426; SEARCH_ID=f8e30fdbd29e42f5bd02662ab2cef21f; user_trace_token=20180625154048-198ac502-c5d4-4114-907d-7dcca0c7dd47; _ga=GA1.2.810701623.1529912450; _gat=1; LGSID=20180625154049-149d80f2-784b-11e8-b069-525400f775ce; PRE_UTM=; PRE_HOST=static.dcxueyuan.com; PRE_SITE=https%3A%2F%2Fstatic.dcxueyuan.com%2Fcontent%2Fdisk%2Ftrain%2Fother%2F70b2c405-138b-4862-ad49-138656aef0d6.html; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E7%2588%25AC%25E8%2599%25AB%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGUID=20180625154049-149d839e-784b-11e8-b069-525400f775ce; X_HTTP_TOKEN=188845654580e592f42f58c18962c06c; LGRID=20180625154312-699841c7-784b-11e8-b06b-525400f775ce", "Referer": "https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput=" } def get_job_info(page, kd): for i in range(1, page, 1): url = "https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0" payload = { "first": "true", "pn": i, "kd": kd } ua = UserAgent() headers['User-Agent'] = ua.random # 要发送一些编码为表单形式的数据――非常像一个 HTML 表单,只需简单地传递一个字典给 data 参数。 # 如果你是手工构建 URL,那么数据会以键/值对的形式置于 URL 中,跟在一个问号的后面。 # Requests 允许你使用 params 关键字参数,以一个字符串字典来提供这些参数 例如, http://bin.org/get?key=val。 response = requests.get(url, data=payload, headers=headers, timeout=20) if response.status_code == 200: job_json = response.json()['content']['positionResult']['result'] my_set.insert(job_json) # 将json数据插入表中 else: print("something wrong\t") print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())+"\n") print('正在爬去' + str(i) + '页内容\t') print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())) time.sleep(3) if __name__ == '__main__': get_job_info(10, "spark")
爬下来的数据大概长这个样子
文章来源: 爬去拉勾网招聘信息