How to add proxies to BeautifulSoup crawler

问题

These are the definitions in the python crawler:

from __future__ import with_statement

from eventlet.green import urllib2
import eventlet
import re
import urlparse
from bs4 import BeautifulSoup, SoupStrainer
import sqlite3
import datetime

How to I add a rotating proxy (one proxy per open thread) to a recursive cralwer working on BeautifulSoup?

I know how to add proxies if I was using Mechanise's browser:

br = Browser()
br.set_proxies({'http':'http://username:password@proxy:port',
'https':'https://username:password@proxy:port'})

but I would like to know specifically what kind of solution would BeautifulSoup require.

Thank you very much for your help!

回答1:

Have a look at the example of BeautifulSoup using HTTP Proxy

http://monzool.net/blog/2007/10/15/html-parsing-with-beautiful-soup/

回答2:

Heads up that there is a less complex solution to this available now, shared here:

import requests

proxies = {"http": "http://10.10.1.10:3128",
           "https": "http://10.10.1.10:1080"}

requests.get("http://example.org", proxies=proxies)

Then do your beautifulsoup as normal from the request response.

So if you want separate threads with different proxies, you just call different dictionary entries for each request (e.g. from a list of dicts).

This seems more straight-forward to implement when your existing package use is already requests / bs4, since it is just an extra **kwargs added on your existing requests.get() call. You don't have to initialize/install/open separate urllib handlers for each thread.

来源：https://stackoverflow.com/questions/12464799/how-to-add-proxies-to-beautifulsoup-crawler

标签

python

beautifulsoup

proxy

web-crawler