PHP/MySQL - an array filter for bots

前端 未结 4 1271
隐瞒了意图╮
隐瞒了意图╮ 2020-12-17 07:19

I\'m making a hit counter. I have a database and I store the IP and $_SERVER[\'HTTP_USER_AGENT\']; of the visitors. Now I need to add a filter, so I can put awa

相关标签:
4条回答
  • 2020-12-17 07:54

    there are certain systems that try and support semi-current DB of known bot strings, such as CubeCart and oscommerce before. they do that in order to have a boolean function that filters a user from bot in real time through string comparison of the user agent string against a file called spiders.txt. after discovering a bot, they disable shopping basket and login functionality etc.

    here is the latest spiders.txt contents:

    abacho abcdatos abcsearch acoon adsarobot aesop ah-ha alkalinebot almaden altavista antibot anzwerscrawl aol search appie arachnoidea araneo architext ariadne arianna ask jeeves aspseek asterias astraspider atomz augurfind backrub baiduspider bannana_bot bbot bdcindexer blindekuh boitho boito borg-bot bsdseek christcrawler computer_and_automation_research_institute_crawler coolbot cosmos crawler crawler@fast crawlerboy cruiser cusco cyveillance deepindex denmex dittospyder docomo dogpile dtsearch elfinbot entire web esismartspider exalead excite ezresult fast fast-webcrawler fdse felix fido findwhat finnish firefly firstgov fluffy freecrawl frooglebot galaxy gaisbot geckobot gencrawler geobot gigabot girafa goclick goliat googlebot griffon gromit grub-client gulliver gulper henrythemiragorobot hometown hotbot htdig hubater ia_archiver ibm_planetwide iitrovatore-setaccio incywincy incrawler indy infonavirobot infoseek ingrid inspectorwww intelliseek internetseer ip3000.com-crawler iron33 jcrawler jeeves jubii kanoodle kapito kit_fireball kit-fireball ko_yappo_robot kototoi lachesis larbin legs linkwalker lnspiderguy look.com lycos mantraagent markwatch maxbot mercator merzscope meshexplorer metacrawler mirago mnogosearch moget motor muscatferret nameprotect nationaldirectory naverrobot nazilla ncsa beta netnose netresearchserver ng/1.0 northerlights npbot nttdirectory_robot nutchorg nzexplorer odp openbot openfind osis-project overture perlcrawler phpdig pjspide polybot pompos poppi portalb psbot quepasacreep rabot raven rhcs robi robocrawl robozilla roverbot scooter scrubby search.ch search.com.ua searchfeed searchspider searchuk seventwentyfour sidewinder sightquestbot skymob sleek slider_search slurp solbot speedfind speedy spida spider_monkey spiderku stackrambler steeler suchbot suchknecht.at-robot suntek szukacz surferf3 surfnomore surveybot suzuran synobot tarantula teomaagent teradex t-h-u-n-d-e-r-s-t-o-n-e tigersuche topiclink toutatis tracerlock turnitinbot tutorgig uaportal uasearch.kiev.ua uksearcher ultraseek unitek vagabondo verygoodsearch vivisimo voilabot voyager vscooter w3index w3c_validator wapspider wdg_validator webcrawler webmasterresourcesdirectory webmoose websearchbench webspinne whatuseek whizbanglab winona wire wotbox wscbot www.webwombat.com.au xenu link sleuth xyro yahoobot yahoo! slurp yandex yellopet-spider zao/0 zealbot zippy zyborg

    as long as you don't do cloaking like this, you're ok.

    0 讨论(0)
  • 2020-12-17 07:56

    Using Dimitar Christoff's list ended up with this script:

    function isBot($user_agent){
        $bots = array('bingbot', 'msn', 'abacho', 'abcdatos', 'abcsearch', 'acoon', 'adsarobot', 'aesop', 'ah-ha',
                'alkalinebot', 'almaden', 'altavista', 'antibot', 'anzwerscrawl', 'aol', 'search', 'appie', 'arachnoidea',
                'araneo', 'architext', 'ariadne', 'arianna', 'ask', 'jeeves', 'aspseek', 'asterias', 'astraspider', 'atomz',
                'augurfind', 'backrub', 'baiduspider', 'bannana_bot', 'bbot', 'bdcindexer', 'blindekuh', 'boitho', 'boito',
                'borg-bot', 'bsdseek', 'christcrawler', 'computer_and_automation_research_institute_crawler', 'coolbot',
                'cosmos', 'crawler', 'crawler@fast', 'crawlerboy', 'cruiser', 'cusco', 'cyveillance', 'deepindex', 'denmex',
                'dittospyder', 'docomo', 'dogpile', 'dtsearch', 'elfinbot', 'entire', 'web', 'esismartspider', 'exalead',
                'excite', 'ezresult', 'fast', 'fast-webcrawler', 'fdse', 'felix', 'fido', 'findwhat', 'finnish', 'firefly',
                'firstgov', 'fluffy', 'freecrawl', 'frooglebot', 'galaxy', 'gaisbot', 'geckobot', 'gencrawler', 'geobot',
                'gigabot', 'girafa', 'goclick', 'goliat', 'googlebot', 'griffon', 'gromit', 'grub-client', 'gulliver',
                'gulper', 'henrythemiragorobot', 'hometown', 'hotbot', 'htdig', 'hubater', 'ia_archiver', 'ibm_planetwide',
                'iitrovatore-setaccio', 'incywincy', 'incrawler', 'indy', 'infonavirobot', 'infoseek', 'ingrid', 'inspectorwww',
                'intelliseek', 'internetseer', 'ip3000.com-crawler', 'iron33', 'jcrawler', 'jeeves', 'jubii', 'kanoodle',
                'kapito', 'kit_fireball', 'kit-fireball', 'ko_yappo_robot', 'kototoi', 'lachesis', 'larbin', 'legs',
                'linkwalker', 'lnspiderguy', 'look.com', 'lycos', 'mantraagent', 'markwatch', 'maxbot', 'mercator', 'merzscope',
                'meshexplorer', 'metacrawler', 'mirago', 'mnogosearch', 'moget', 'motor', 'muscatferret', 'nameprotect',
                'nationaldirectory', 'naverrobot', 'nazilla', 'ncsa', 'beta', 'netnose', 'netresearchserver', 'ng/1.0',
                'northerlights', 'npbot', 'nttdirectory_robot', 'nutchorg', 'nzexplorer', 'odp', 'openbot', 'openfind',
                'osis-project', 'overture', 'perlcrawler', 'phpdig', 'pjspide', 'polybot', 'pompos', 'poppi', 'portalb',
                'psbot', 'quepasacreep', 'rabot', 'raven', 'rhcs', 'robi', 'robocrawl', 'robozilla', 'roverbot', 'scooter',
                'scrubby', 'search.ch', 'search.com.ua', 'searchfeed', 'searchspider', 'searchuk', 'seventwentyfour',
                'sidewinder', 'sightquestbot', 'skymob', 'sleek', 'slider_search', 'slurp', 'solbot', 'speedfind', 'speedy',
                'spida', 'spider_monkey', 'spiderku', 'stackrambler', 'steeler', 'suchbot', 'suchknecht.at-robot', 'suntek',
                'szukacz', 'surferf3', 'surfnomore', 'surveybot', 'suzuran', 'synobot', 'tarantula', 'teomaagent', 'teradex',
                't-h-u-n-d-e-r-s-t-o-n-e', 'tigersuche', 'topiclink', 'toutatis', 'tracerlock', 'turnitinbot', 'tutorgig',
                'uaportal', 'uasearch.kiev.ua', 'uksearcher', 'ultraseek', 'unitek', 'vagabondo', 'verygoodsearch', 'vivisimo',
                'voilabot', 'voyager', 'vscooter', 'w3index', 'w3c_validator', 'wapspider', 'wdg_validator', 'webcrawler',
                'webmasterresourcesdirectory', 'webmoose', 'websearchbench', 'webspinne', 'whatuseek', 'whizbanglab', 'winona',
                'wire', 'wotbox', 'wscbot', 'www.webwombat.com.au', 'xenu', 'link', 'sleuth', 'xyro', 'yahoobot', 'yahoo!',
                'slurp', 'yandex', 'yellopet-spider', 'zao/0', 'zealbot', 'zippy', 'zyborg', 'mediapartners-google'
                    );
        $user_agent = strtolower($user_agent);
        foreach($bots as $bot){
            if(strpos($user_agent, $bot) === true){
                return true;
            }
        }
        return false;
    }
    
    0 讨论(0)
  • 2020-12-17 08:06

    And why not?

    $bots=array('google', 'bing', 'yahoo', 'bot', 'crawler', 'baiduspider');
    if(!preg_match('/('.join('|', $bots).')/is', $_SERVER['HTTP_USER_AGENT'])){
      //record to db code
    }
    
    0 讨论(0)
  • 2020-12-17 08:08

    Loop through the array of words with foreach and check if the current word exists in the UA string using strpos():

    foreach ($words as $word) {
        if (strpos($row['user_agent'], $word) !== FALSE) {
            // word exists in string
        }
    }
    
    0 讨论(0)
提交回复
热议问题