How do I ignore file types in a web crawler?

前端 未结 3 1157
隐瞒了意图╮
隐瞒了意图╮ 2021-01-17 05:22

I\'m writing a web crawler and want to ignore URLs which link to binary files:

$exclude = %w(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico         


        
相关标签:
3条回答
  • 2021-01-17 05:45

    You can strip off the URL's file extension with a regular expression or split (I've shown the latter here, but beware this will also match some malformed URLs, such as http://foo.exe), then use Array#include? to check for membership:

    @url = URI.parse(url) unless $exclude.include?(url.split('.').last)
    
    0 讨论(0)
  • 2021-01-17 05:49

    Ruby lacks a really useful module that Perl has, called Regexp::Assemble. Ruby's Regexp::Union comes nowhere near it. Here's how to use Regexp::Assemble, and its result:

    use Regexp::Assemble;
    
    my @extensions = sort qw(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico raw mp3 mp4 wav wmv ape aac ac3 wma aiff mpg mpeg avi mov ogg mkv mka asx asf mp2 m1v m3u f4v pdf doc xls ppt pps bin exe rss xml);
    
    my $ra = Regexp::Assemble->new;
    $ra->add(@extensions);
    
    print $ra->re, "\n";
    

    Which outputs:

    (?-xism:(?:m(?:p(?:[234]|e?g)|[1o]v|k[av]|3u)|a(?:s[fx]|iff|ac|c3|pe|vi)|p(?:p[st]|df|ng)|r(?:a[rw]|ss)|w(?:m[av]|av)|x(?:ls|ml|sd)|j(?:ar|pg|s)|d(?:oc|td)|g(?:if|z)|f[4l]v|bin|css|exe|ico|ogg|swf|tar|zip|7z))
    

    Perl supports the s flag and Ruby doesn't, so that needs to be taken out of ?-xism, and we want to ignore character case so the i needs to be moved, resulting in ?i-xm.

    Plug that into a Ruby script as the regular expression:

    REGEX = /(?i-xm:(?:m(?:p(?:[234]|e?g)|[1o]v|k[av]|3u)|a(?:s[fx]|iff|ac|c3|pe|vi)|p(?:p[st]|df|ng)|r(?:a[rw]|ss)|w(?:m[av]|av)|x(?:ls|ml|sd)|j(?:ar|pg|s)|d(?:oc|td)|g(?:if|z)|f[4l]v|bin|css|exe|ico|ogg|swf|tar|zip|7z))/
    
    @url = URI.parse(url)
    
    puts @url.path[REGEX]
    
    uri = URI.parse('http://foo.com/bar.jpg')
    uri.path        # => "/bar.jpg"
    uri.path[REGEX] # => "jpg"
    

    See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more about using Regexp::Assemble from Ruby.

    0 讨论(0)
  • 2021-01-17 06:00

    use URI#path:

    unless URI.parse(url).path =~ /\.(\w+)$/ && $exclude.include?($1)
      puts "downloading #{url}..."
    end
    
    0 讨论(0)
提交回复
热议问题