How do I ignore file types in a web crawler?

前端 未结 3 1144
隐瞒了意图╮
隐瞒了意图╮ 2021-01-17 05:22

I\'m writing a web crawler and want to ignore URLs which link to binary files:

$exclude = %w(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico         


        
3条回答
  •  广开言路
    2021-01-17 05:45

    You can strip off the URL's file extension with a regular expression or split (I've shown the latter here, but beware this will also match some malformed URLs, such as http://foo.exe), then use Array#include? to check for membership:

    @url = URI.parse(url) unless $exclude.include?(url.split('.').last)
    

提交回复
热议问题