excessive traffic from facebookexternalhit bot

前端 未结 3 1853
灰色年华
灰色年华 2020-12-03 06:01

Does anyone know how tell the \'facebookexternalhit\' bot to spread its traffic?

Our website gets hammered every 45 - 60 minutes with spikes of approx. 400 requests

相关标签:
3条回答
  • 2020-12-03 06:17

    I know it's an old, but unanswered, question. I hope this answer helps someone.

    There's an Open Graph tag named og:ttl that allows you to slow down the requests made by the Facebook crawler: (reference)

    Crawler rate limiting You can label pages and objects to change how long Facebook's crawler will wait to check them for new content. Use the og:ttl object property to limit crawler access if our crawler is being too aggressive.

    Checking object properties for og:ttl states that the default ttl is 30 days for each canonical URL shared. So setting this ttl meta tag will only slow requests down if you have a very large amount of shared objects over time.

    But, if you're being reached by Facebook's crawler because of actual live traffic (users sharing a lot of your stories at the same time), this will of course not work.

    Another possibility for you to have too many crawler requests, is that your stories are not being shared using a correct canonical url (og:url) tag. Let's say, your users can reach certain article on your site from several different sources (actually being able to see and share the same article, but the URL they see is different), if you don't set the same og:url tag for all of them, Facebook will think it's a different article, hence generating over time crawler requests to all of them instead of just to the one and only canonical URL. More info here.

    Hope it helps.

    0 讨论(0)
  • 2020-12-03 06:21

    Per other answers, the semi-official word from Facebook is "suck it". It boggles me they cannot follow Crawl-delay (yes, I know it's not a "crawler", however GET'ing 100 pages in a few seconds is a crawl, whatever you want to call it).

    Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.

    In PHP, execute the following code as quickly as possible for every request.

    define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit
    
    if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && strpos(  $_SERVER['HTTP_USER_AGENT'], 'facebookexternalhit' ) === 0 ) {
        $fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
        if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
            $lastTime = fread( $fh, 100 );
            $microTime = microtime( TRUE );
            // check current microtime with microtime of last access
            if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
                // bail if requests are coming too quickly with http 503 Service Unavailable
                header( $_SERVER["SERVER_PROTOCOL"].' 503' );
                die;
            } else {
                // write out the microsecond time of last access
                rewind( $fh );
                fwrite( $fh, $microTime );
            }
            fclose( $fh );
        } else {
            header( $_SERVER["SERVER_PROTOCOL"].' 429' );
            die;
        }
    }
    

    You can test this from a command line with something like:

    $ rm index.html*; wget -U "facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)" http://www.foobar.com/; less index.html
    

    Improvement suggestions are welcome... I would guess their might be some concurrency issues with a huge blast.

    0 讨论(0)
  • 2020-12-03 06:33

    We had same problems on our website/server. The problem was the og:url metatag. After removing it, the problem was solved for most facebookexternalhit calls.

    Another problem was, that some pictures we specified in the og:image tag, were not existing. So the facebookexternhit scraper called every image on the url for each call of the url.

    0 讨论(0)
提交回复
热议问题