问题
These are the definitions in the python crawler:
from __future__ import with_statement
from eventlet.green import urllib2
import eventlet
import re
import urlparse
from bs4 import BeautifulSoup, SoupStrainer
import sqlite3
import datetime
How to I add a rotating proxy (one proxy per open thread) to a recursive cralwer working on BeautifulSoup?
I know how to add proxies if I was using Mechanise's browser:
br = Browser()
br.set_proxies({'http':'http://username:password@proxy:port',
'https':'https://username:password@proxy:port'})
but I would like to know specifically what kind of solution would BeautifulSoup require.
Thank you very much for your help!
回答1:
Have a look at the example of BeautifulSoup using HTTP Proxy
http://monzool.net/blog/2007/10/15/html-parsing-with-beautiful-soup/
回答2:
Heads up that there is a less complex solution to this available now, shared here:
import requests
proxies = {"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080"}
requests.get("http://example.org", proxies=proxies)
Then do your beautifulsoup as normal from the request response.
So if you want separate threads with different proxies, you just call different dictionary entries for each request (e.g. from a list of dicts).
This seems more straight-forward to implement when your existing package use is already requests / bs4, since it is just an extra **kwargs
added on your existing requests.get()
call. You don't have to initialize/install/open separate urllib handlers for each thread.
来源:https://stackoverflow.com/questions/12464799/how-to-add-proxies-to-beautifulsoup-crawler