Facebook crawler is hitting my server hard and ignoring directives. Accessing same resources multiple times

后端 未结 8 1434
盖世英雄少女心
盖世英雄少女心 2021-02-05 05:52

The Facebook Crawler is hitting my servers multiple times every second and it seems to be ignoring both the Expires header and the og:ttl property.

In some cases, it is

相关标签:
8条回答
  • 2021-02-05 06:38

    I received word back from the Facebook team themselves. Hopefully, it brings some clarification to how the crawler treats image URLs.

    Here it goes:

    The Crawler treats image URLs differently than other URLs.

    We scrape images multiple times because we have different physical regions, each of which need to fetch the image. Since we have around 20 different regions, the developer should expect ~20 calls for each image. Once we make these requests, they stay in our cache for around a month - we need to rescrape these images frequently to prevent abuse on the platform (a malicious actor could get us to scrape a benign image and then replace it with an offensive one).

    So basically, you should expect that the image specified in og:image will be hit 20 times after it has been shared. Then, a month later, it will be scraped again.

    0 讨论(0)
  • 2021-02-05 06:48

    Sending blindly 304 Not Modified header does not have much sense and can confuse Facebook's crawler even more. If you really decide to just block some request you may consider 429 Too Many Requests header - it will at least clearly indicate what the problem is.

    As a more gentle solution you may try:

    • Add Last-Modified header with some static value. Facebook's crawler may be clever enough to detect that for constantly changing content it should ignore Expires header but not clever enough to handle missing header properly.
    • Add ETag header with proper 304 Not Modified support.
    • Change Cache-Control header to max-age=315360000, public, immutable if the image is static.

    You may also consider saving cached image and serving it via webserver without involving PHP. If you change URLs to something like http://fb.example.com/img/image/123790824792439jikfio09248384790283940829044 You can create fallback for nonexistent files by rewrite rules:

    RewriteEngine On
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule ^img/image/([0-9a-z]+)$ img/image.php?id=$1 [L]
    

    Only first request should be handled by PHP, which will save cache for requested URL (for example in /img/image/123790824792439jikfio09248384790283940829044). Then for all further requests webserver should take care of serving content from cached file, sending proper headers and handling 304 Not Modified. You may also configure nginx for rate limiting - it should be more efficient than delegating serving images to PHP.

    0 讨论(0)
提交回复
热议问题