实战1:建立代理IP池

。_饼干妹妹 提交于 2019-12-03 06:16:26

1、爬取代理IP:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# Author:Meng Zhaoce
import requests
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool as ThreadPool #多线程模块
from pymongo import MongoClient
data = []

def getIp(page):
    url = 'https://www.xicidaili.com/nt/%d'%(page)
    headers ={
        'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'

    }#伪装请求头
    res = requests.get(url,headers=headers).text #发送请求
    soup = BeautifulSoup(res,'lxml')
    for i in soup.find_all('tr'):
        try:
            data.append({'ip':'%s:%s'%(i.find_all('td')[1].get_text(),i.find_all('td')[2].get_text()),'verify':False})
        except:
            continue

pool = ThreadPool(10)
pool.map(getIp,[i for i in range(100)])
pool.close()
pool.join()
print(data)
print(len(data))

db = MongoClient('127.0.0.1',27017).test
db.ippool.insert_many(data)

此处涉及知识点:请求库、解析库、多线程模块、菲关系型数据库

 

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!