Facebook crawler is hitting my server hard and ignoring directives. Accessing same resources multiple times

后端 未结 8 1432
盖世英雄少女心
盖世英雄少女心 2021-02-05 05:52

The Facebook Crawler is hitting my servers multiple times every second and it seems to be ignoring both the Expires header and the og:ttl property.

In some cases, it is

相关标签:
8条回答
  • 2021-02-05 06:21

    Facebook documentation specifically states "Images are cached based on the URL and won't be updated unless the URL changes.". This means it doesn't matter which headers or meta tags you add to your page, the bot is supposed to cache the image anyway.

    This made me think:

    1. Does each user share a slightly different URL of your page? This will cause the share image to get re-cached each time.
    2. Is your share image accessed using a slightly different URL?
    3. Maybe the image is being linked differently somewhere?

    I'd monitor the page logs and see exactly what happens - if the page URL or the image URL is even slightly different, the caching mechanism won't work. Luckily, this doesn't seem like a headers/tags type of issue.

    0 讨论(0)
  • 2021-02-05 06:22

    If the FB crawlers ignore your cache headers, adding the "ETag" header could be used in this case to return correct 304 responses and reduce the load of your server.

    The first time you generate an image, calculate the hash of that image (for example using md5) as the "ETag" response header. If your server receives a request with the "If-None-Match" header, check if you already have returned that hash. If the answer is yes, return a 304 response. If not, generate the image.

    Checking if you already have returned a given hash (while avoiding to generate the image again) means that you'll need to store the hash somewhere... Maybe saving the images in a tmp folder and using the hash as the file name?

    More info about "ETag" + "If-None-Match" headers.

    0 讨论(0)
  • 2021-02-05 06:26

    According to Facebook documentation only Facebot crawler respects the crawling directives. However they also suggest this

    You can target one of these user agents to serve the crawler a nonpublic version of your page that has only metadata and no actual content. This helps optimize performance and is useful for keeping paywalled content secure.

    Some people suggest to rate limit the access for facebookexternalhit however I doubt that is a good idea since it may prevent the crawler to update the content.

    Seeing multiple hits from different IPs but the same bot may be acceptable, depending on their architecture. You should check how often the same resource gets crawled. og:ttl is what the documentation recommends and should help.

    0 讨论(0)
  • 2021-02-05 06:31

    It would appear the Facebook's crawlers aren't always that respectful. In the past we've implemented the suggestion here: excessive traffic from facebookexternalhit bot.

    It's not the best solution as it would be nice for Facebook to limit the rate of requesting but clearly they don't do that.

    0 讨论(0)
  • 2021-02-05 06:32

    After I tried almost everything else with caching, headers and what not, the only thing that saved our servers from "overly enthusiastic" Facebook crawler (user agent facebookexternalhit) was simply denying the access and sending back HTTP/1.1 429 Too Many Requests HTTP response, when the crawler "crawled too much".

    Admittedly, we had thousands of images we wanted the crawler to crawl, but Facebook crawler was practically DDOSing our server with tens of thousands of requests (yes, the same URLs over and over), per hour. I remember it was 40 000 requests per hour from different Facebook's IP addresses using te facebookexternalhit user agent at one point.

    We did not want to block the the crawler entirely and blocking by IP address was also not an option. We only needed the FB crawler to back off (quite) a bit.

    This is a piece of PHP code we used to do it:

    .../images/index.php

    <?php
    
    // Number of requests permitted for facebook crawler per second.
    const FACEBOOK_REQUEST_THROTTLE = 5;
    const FACEBOOK_REQUESTS_JAR = __DIR__ . '/.fb_requests';
    const FACEBOOK_REQUESTS_LOCK = __DIR__ . '/.fb_requests.lock';
    
    function handle_lock($lockfile) {
        flock(fopen($lockfile, 'w'), LOCK_EX);
    }
    
    $ua = $_SERVER['HTTP_USER_AGENT'] ?? false;
    if ($ua && strpos($ua, 'facebookexternalhit') !== false) {
    
        handle_lock(FACEBOOK_REQUESTS_LOCK);
    
        $jar = @file(FACEBOOK_REQUESTS_JAR);
        $currentTime = time();
        $timestamp = $jar[0] ?? time();
        $count = $jar[1] ?? 0;
    
        if ($timestamp == $currentTime) {
            $count++;
        } else {
            $count = 0;
        }
    
        file_put_contents(FACEBOOK_REQUESTS_JAR, "$currentTime\n$count");
    
        if ($count >= FACEBOOK_REQUEST_THROTTLE) {
            header("HTTP/1.1 429 Too Many Requests", true, 429);
            header("Retry-After: 60");
            die;
        }
    
    }
    
    // Everything under this comment happens only if the request is "legit". 
    
    $filePath = $_SERVER['DOCUMENT_ROOT'] . $_SERVER['REQUEST_URI'];
    if (is_readable($filePath)) {
        header("Content-Type: image/png");
        readfile($filePath);
    }
    

    You also need to configure rewriting to pass all requests directed at your images to this PHP script:

    .../images/.htaccess (if you're using Apache)

    RewriteEngine On
    RewriteRule .* index.php [L] 
    

    It seems like the crawler "understood this" approach and effectively reduced the attempt rate from tens of thousands requests per hour to hundreds/thousands requests per hour.

    0 讨论(0)
  • 2021-02-05 06:36

    @Nico suggests

    We had same problems on our website/server. The problem was the og:url metatag. After removing it, the problem was solved for most facebookexternalhit calls.

    So you could try removing that and see if it fixes the problem

    0 讨论(0)
提交回复
热议问题