How do I ignore file types in a web crawler?

前端 未结 3 1145
隐瞒了意图╮
隐瞒了意图╮ 2021-01-17 05:22

I\'m writing a web crawler and want to ignore URLs which link to binary files:

$exclude = %w(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico         


        
3条回答
  •  终归单人心
    2021-01-17 06:00

    use URI#path:

    unless URI.parse(url).path =~ /\.(\w+)$/ && $exclude.include?($1)
      puts "downloading #{url}..."
    end
    

提交回复
热议问题