How to recognize bots with php?

后端 未结 7 783
囚心锁ツ
囚心锁ツ 2020-12-24 14:33

I am building stats for my users and dont wish the visits from bots to be counted.

Now I have a basic php with mysql increasing 1 each time the page is called.

相关标签:
7条回答
  • 2020-12-24 15:08

    Have you tried identifying them by their user-agent information? A simple google search should give you the user-agents used by Google etc.

    This, of course, is not foolproof, but most crawlers by major companies supply a distinct user-agent.

    EDIT: Assuming you do not want to restrict the bots access, but just not count its visit in your statistc.

    0 讨论(0)
  • 2020-12-24 15:13

    100% Working Bot detector. It is working on my website to detect robots, crawlers, spiders, and copiers.

    function isBotDetected() {
    
        if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
        ) {
            return true; // 'Above given bots detected'
        }
    
        return false;
    
    } // End :: isBotDetected()
    
    0 讨论(0)
  • 2020-12-24 15:17

    We've a similar use-case to yourself, and one option we've recently found quite helpful is the UASParser class from user-agent-string.info.

    It's a PHP class which pulls the latest set of user agent string definitions and caches them locally. The class can be configured to pull the definitions as often or as rarely as you deem fit. Automatically fetching them like this means that you don't have to keep on top of the various changes to bot user agents or new ones coming on the market, although you are relying on UAS.info to do this accurately.

    When the class is called, it parses the current visitor's user agent and returns an associative array breaking out the constituent parts, e.g.

    Array
    (
        [typ] => browser
        [ua_family] => Firefox
        [ua_name] => Firefox 3.0.8
        [ua_url] => http://www.mozilla.org/products/firefox/
        [ua_company] => Mozilla Foundation
        ........
        [os_company] => Microsoft Corporation.
        [os_company_url] => http://www.microsoft.com/
        [os_icon] => windowsxp.png
    )
    

    The field typ is set to browser when the UA is identified as likely belonging to a human visitor, in which case you can update your stats.

    Couple of caveats here:

    • You're relying on UAS.info for the user agent strings provided to be accurate and up-to-date
    • Bots like google and yahoo declare themselves in their user agent strings, but this method will still count visits from bots pretending to be human visitors (sending spoofed UAs)
    • As @amdfan mentioned above, blocking bots via robots.txt should stop most of them from reaching your page. If you need the content to be indexed but not increment stats, then the robots.txt method won't be a realistic option
    0 讨论(0)
  • 2020-12-24 15:17

    Check the user agent before incrementing the page view count, but remember that this can be spoofed. PHP exposes the user agent in $_SERVER['HTTP_USER_AGENT'], assuming that the web server provides it with this information. More information about $_SERVER can be found at http://www.php.net/manual/en/reserved.variables.server.php.

    You can find a list of user agents at http://www.user-agents.org; Googling will also provide the names of those belonging to the major providers. A third possible source would be your web server's access logs, if you can aggregate them.

    0 讨论(0)
  • 2020-12-24 15:17

    This function worked to me and i found on https://www.cult-f.net/detect-crawlers-with-php/ website:

    <?php
      $crawlers = array(
        'Google'=>'Google',
        'MSN' => 'msnbot',
        'Rambler'=>'Rambler',
        'Yahoo'=> 'Yahoo',
        'AbachoBOT'=> 'AbachoBOT',
        'accoona'=> 'Accoona',
        'AcoiRobot'=> 'AcoiRobot',
        'ASPSeek'=> 'ASPSeek',
        'CrocCrawler'=> 'CrocCrawler',
        'Dumbot'=> 'Dumbot',
        'FAST-WebCrawler'=> 'FAST-WebCrawler',
        'GeonaBot'=> 'GeonaBot',
        'Gigabot'=> 'Gigabot',
        'Lycos spider'=> 'Lycos',
        'MSRBOT'=> 'MSRBOT',
        'Altavista robot'=> 'Scooter',
        'AltaVista robot'=> 'Altavista',
        'ID-Search Bot'=> 'IDBot',
        'eStyle Bot'=> 'eStyle',
        'Scrubby robot'=> 'Scrubby',
        );
     
    function crawlerDetect($USER_AGENT)
    {
        // to get crawlers string used in function uncomment it
        // it is better to save it in string than use implode every time
        // global $crawlers
        // $crawlers_agents = implode('|',$crawlers);
        $crawlers_agents = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
     
        if ( strpos($crawlers_agents , $USER_AGENT) === false )
           return false;
        // crawler detected
        // you can use it to return its name
        /*
        else {
           return array_search($USER_AGENT, $crawlers);
        }
        */
    }
     
    // example
     
    $crawler = crawlerDetect($_SERVER['HTTP_USER_AGENT']);
     
    if ($crawler )
    {
       // it is crawler, it's name in $crawler variable
    }
    else
    {
       // usual visitor
    }
    
    
    0 讨论(0)
  • 2020-12-24 15:18

    You should filter by user-agent strings. You can find a list of about 300 common user-agents given by bots here: http://www.robotstxt.org/db.html Running through that list and ignoring bot user-agents before you run your SQL statement should solve your problem for all practical purposes.

    If you don't want the search engines to even reach the page, use a basic robots.txt file to block them.

    0 讨论(0)
提交回复
热议问题