爬去拉勾网招聘信息

匿名 (未验证) 提交于 2019-12-03 00:37:01


https://blog.csdn.net/xiaoduan_/article/details/80835231

在拉勾网发现他们招聘信息的返回接口是json接口,有这样好的数据接口怎么能不爬那。
平时比较喜欢spark,那就来爬spark的招聘信息然后放到MongoDB里面吧

#!/usr/bin/env python3 # -*- coding: utf-8 -*- # @Author  : Anthony_Duan # @Time    : 25/06/2018 15:53 # @File    : lagou.py # @Software: PyCharm  import requests from fake_useragent import UserAgent import time from pymongo import MongoClient  client = MongoClient() db = client.lagou  # 连接数据库,如果没有该数据库就创建一个 my_set = db.spark_job  # 定义lagou数据库下的job表 没有自动创建  headers = {     "Cookie": "JSESSIONID=ABAAABAAAIAACBICB3D046BA1BEA314A00EA18BD6391426; SEARCH_ID=f8e30fdbd29e42f5bd02662ab2cef21f; user_trace_token=20180625154048-198ac502-c5d4-4114-907d-7dcca0c7dd47; _ga=GA1.2.810701623.1529912450; _gat=1; LGSID=20180625154049-149d80f2-784b-11e8-b069-525400f775ce; PRE_UTM=; PRE_HOST=static.dcxueyuan.com; PRE_SITE=https%3A%2F%2Fstatic.dcxueyuan.com%2Fcontent%2Fdisk%2Ftrain%2Fother%2F70b2c405-138b-4862-ad49-138656aef0d6.html; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E7%2588%25AC%25E8%2599%25AB%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGUID=20180625154049-149d839e-784b-11e8-b069-525400f775ce; X_HTTP_TOKEN=188845654580e592f42f58c18962c06c; LGRID=20180625154312-699841c7-784b-11e8-b06b-525400f775ce",     "Referer": "https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput=" }   def get_job_info(page, kd):     for i in range(1, page, 1):         url = "https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0"          payload = {             "first": "true",             "pn": i,             "kd": kd         }         ua = UserAgent()         headers['User-Agent'] = ua.random         # 要发送一些编码为表单形式的数据――非常像一个 HTML 表单,只需简单地传递一个字典给 data 参数。         # 如果你是手工构建 URL,那么数据会以键/值对的形式置于 URL 中,跟在一个问号的后面。         #  Requests 允许你使用 params 关键字参数,以一个字符串字典来提供这些参数 例如, http://bin.org/get?key=val。         response = requests.get(url, data=payload, headers=headers, timeout=20)          if response.status_code == 200:             job_json = response.json()['content']['positionResult']['result']             my_set.insert(job_json)  # 将json数据插入表中         else:             print("something wrong\t")             print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())+"\n")          print('正在爬去' + str(i) + '页内容\t')         print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))         time.sleep(3)   if __name__ == '__main__':     get_job_info(10, "spark")

爬下来的数据大概长这个样子

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!