Scraping attempts getting 403 error

后端 未结 1 1057
暖寄归人
暖寄归人 2020-12-20 09:09

I am trying to scrape a website and I am getting a 403 Forbidden error no matter what I try:

  1. wget
  2. CURL (command line and PHP)
  3. Perl WWW::Mecha
相关标签:
1条回答
  • 2020-12-20 09:29

    First, note that the site does not like web scraping. As @KeepCalmAndCarryOn pointed out in a comment this site has a /robots.txt where it explicitly asks bots to not crawl specific parts of the site, including the parts you want to scrape. While not legally binding a good citizen will adhere to such request.

    Additionally the site seems to employ explicit protection against scraping and tries to make sure that this is really a browser. It looks like the site is behind the Akamai CDN, so maybe the anti-scraping protection is from this CDN.

    But I've took the request sent by Firefox (which worked) and then tried to simplify it as much as possible. The following works currently for me, but might of course fail if the site updates its browser detection:

    use strict;
    use warnings;
    use IO::Socket::SSL;
    
    (my $rq = <<'RQ') =~s{\r?\n}{\r\n}g;
    GET /productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=151047598285 HTTP/1.1
    Host: www.vitacost.com
    Accept: */*
    Accept-Language: en-US
    Connection: keep-alive
    
    RQ
    
    my $cl = IO::Socket::SSL->new('www.vitacost.com:443') or die;
    print $cl $rq;
    my $hdr = '';
    while (<$cl>) {
        $hdr .= $_;
        last if $_ eq "\r\n";
    }
    warn "[header done]\n";
    my $len = $hdr =~m{^Content-length:\s*(\d+)}mi && $1 or die "no length";
    read($cl,my $buf,$len);
    print $buf;
    

    Interestingly, if I remove the Accept header I get a 403 Forbidden. If I instead remove the Accept-Language it simply hangs. And also interestingly it does not seem to need a User-Agent header.

    EDIT: it looks like the bot-detection also uses the source IP of the sender as feature. While the code above works for me from two different systems it fails to work for a third system (hosted at Digitalocean) and just hangs.

    0 讨论(0)
提交回复
热议问题