I am trying to scrape a website and I am getting a 403 Forbidden error no matter what I try:
First, note that the site does not like web scraping. As @KeepCalmAndCarryOn pointed out in a comment this site has a /robots.txt where it explicitly asks bots to not crawl specific parts of the site, including the parts you want to scrape. While not legally binding a good citizen will adhere to such request.
Additionally the site seems to employ explicit protection against scraping and tries to make sure that this is really a browser. It looks like the site is behind the Akamai CDN, so maybe the anti-scraping protection is from this CDN.
But I've took the request sent by Firefox (which worked) and then tried to simplify it as much as possible. The following works currently for me, but might of course fail if the site updates its browser detection:
use strict;
use warnings;
use IO::Socket::SSL;
(my $rq = <<'RQ') =~s{\r?\n}{\r\n}g;
GET /productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=151047598285 HTTP/1.1
Host: www.vitacost.com
Accept: */*
Accept-Language: en-US
Connection: keep-alive
RQ
my $cl = IO::Socket::SSL->new('www.vitacost.com:443') or die;
print $cl $rq;
my $hdr = '';
while (<$cl>) {
$hdr .= $_;
last if $_ eq "\r\n";
}
warn "[header done]\n";
my $len = $hdr =~m{^Content-length:\s*(\d+)}mi && $1 or die "no length";
read($cl,my $buf,$len);
print $buf;
Interestingly, if I remove the Accept
header I get a 403 Forbidden. If I instead remove the Accept-Language
it simply hangs. And also interestingly it does not seem to need a User-Agent header.
EDIT: it looks like the bot-detection also uses the source IP of the sender as feature. While the code above works for me from two different systems it fails to work for a third system (hosted at Digitalocean) and just hangs.