How to add proxies to BeautifulSoup crawler

痴心易碎 提交于 2020-03-17 12:07:52

问题


These are the definitions in the python crawler:

from __future__ import with_statement

from eventlet.green import urllib2
import eventlet
import re
import urlparse
from bs4 import BeautifulSoup, SoupStrainer
import sqlite3
import datetime

How to I add a rotating proxy (one proxy per open thread) to a recursive cralwer working on BeautifulSoup?

I know how to add proxies if I was using Mechanise's browser:

br = Browser()
br.set_proxies({'http':'http://username:password@proxy:port',
'https':'https://username:password@proxy:port'})

but I would like to know specifically what kind of solution would BeautifulSoup require.

Thank you very much for your help!


回答1:


Have a look at the example of BeautifulSoup using HTTP Proxy

http://monzool.net/blog/2007/10/15/html-parsing-with-beautiful-soup/




回答2:


Heads up that there is a less complex solution to this available now, shared here:

import requests

proxies = {"http": "http://10.10.1.10:3128",
           "https": "http://10.10.1.10:1080"}

requests.get("http://example.org", proxies=proxies)

Then do your beautifulsoup as normal from the request response.

So if you want separate threads with different proxies, you just call different dictionary entries for each request (e.g. from a list of dicts).

This seems more straight-forward to implement when your existing package use is already requests / bs4, since it is just an extra **kwargs added on your existing requests.get() call. You don't have to initialize/install/open separate urllib handlers for each thread.



来源:https://stackoverflow.com/questions/12464799/how-to-add-proxies-to-beautifulsoup-crawler

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!