anemone

Skipping web-pages with extension pdf, zip from crawling in Anemone

点点圈 提交于 2019-12-12 17:18:53
问题 I am developing crawler using anemone gem (Ruby- 1.8.7 and Rails 3.1.1). How should I skip web-pages with extensions pdf, doc, zip, etc. from crawling/downloading. 回答1: ext = %w(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico raw mp3 mp4 wav wmv ape aac ac3 wma aiff mpg mpeg avi mov ogg mkv mka asx asf mp2 m1v m3u f4v pdf doc xls ppt pps bin exe rss xml) Anemone.crawl(url) do |anemone| anemone.skip_links_like /\.#{ext.join('|')}$/ ... end 来源: https://stackoverflow.com

Ruby scraper. How to export to CSV?

。_饼干妹妹 提交于 2019-12-11 05:25:43
问题 I wrote this ruby script to scrape product info from the manufacturer website. The scraping and storage of the product objects in an array works, but I can't figure out how to export the array data to a csv file. This error is being thrown: scraper.rb:45: undefined method `send_data' for main:Object (NoMethodError) I do not understand this piece of code. What's this doing and why isn't it working right? send_data csv_data, :type => 'text/csv; charset=iso-8859-1; header=present', :disposition

Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

一笑奈何 提交于 2019-12-06 07:15:34
问题 Suppose I was trying crawl a website a skip a page that ended like so: http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117 I am currently using Anemone gem in Ruby to build the crawler. I am using the skip_links_like method but my pattern never seems to match. I am trying to make this as generic as possible so it isn't dependent on subpage but just =2105925 (the digits). I have tried /=\d+$/ and /\?.*\d+$/ but it doesn't seem to be working. This similar to

Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

纵然是瞬间 提交于 2019-12-04 13:28:28
Suppose I was trying crawl a website a skip a page that ended like so: http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117 I am currently using Anemone gem in Ruby to build the crawler. I am using the skip_links_like method but my pattern never seems to match. I am trying to make this as generic as possible so it isn't dependent on subpage but just =2105925 (the digits). I have tried /=\d+$/ and /\?.*\d+$/ but it doesn't seem to be working. This similar to Skipping web-pages with extension pdf, zip from crawling in Anemone but I can't make it worth with digits