问题
I am using Python
to scrape pages. Until now I didn't have any complicated issues.
The site that I'm trying to scrape uses a lot of security checks and have some mechanism to prevent scraping.
Using Requests
and lxml
I was able to scrape about 100-150 pages before getting banned by IP. Sometimes I even get ban on first request (new IP, not used before, different C block). I have tried with spoofing headers, randomize time between requests, still the same.
I have tried with Selenium and I got much better results. With Selenium I was able to scrape about 600-650 pages before getting banned. Here I have also tried to randomize requests (between 3-5 seconds, and make time.sleep(300)
call on every 300th request). Despite that, Im getting banned.
From here I can conclude that site have some mechanism where they ban IP if it requested more than X pages in one open browser session or something like that.
Based on your experience what else should I try? Will closing and opening browser in Selenium help (for example after every 100th requests close and open browser). I was thinking about trying with proxies but there are about million of pages and it will be very expansive.
回答1:
If you would switch to the Scrapy web-scraping framework, you would be able to reuse a number of things that were made to prevent and tackle banning:
- the built-in AutoThrottle extension:
This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.
- rotating user agents with scrapy-fake-useragent middleware:
Use a random User-Agent provided by fake-useragent every request
rotating IP addresses:
- Setting Scrapy proxy middleware to rotate on each request
- scrapy-proxies
you can also run it via local proxy & TOR:
- Scrapy: Run Using TOR and Multiple Agents
回答2:
You could use proxies.
You can buy several hundred IPs for very cheap, and use selenium as you previously have done. Furthermore I suggest varying the browser your use and other user-agent parameters.
You could iterate over using a single IP address to load only x number of pages and stopping prior to getting banned.
def load_proxy(PROXY_HOST,PROXY_PORT):
fp = webdriver.FirefoxProfile()
fp.set_preference("network.proxy.type", 1)
fp.set_preference("network.proxy.http",PROXY_HOST)
fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
fp.set_preference("general.useragent.override","whater_useragent")
fp.update_preferences()
return webdriver.Firefox(firefox_profile=fp)
回答3:
I had this problem too. I used urllib
with tor
in python3
.
- download and install tor browser
- testing tor
open terminal and type:
curl --socks5-hostname localhost:9050 <http://site-that-blocked-you.com>
if you see result it's worked.
- Now we should test in python. Now run this code
import socks
import socket
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#set socks5 proxy to use tor
socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
socket.socket = socks.socksocket
req = Request('http://check.torproject.org', headers={'User-Agent': 'Mozilla/5.0', })
html = urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
print(soup('title')[0].get_text())
if you see
Congratulations. This browser is configured to use Tor.
it worked in python too and this means you are using tor for web scraping.
来源:https://stackoverflow.com/questions/35133200/scraping-in-python-preventing-ip-ban