Suppose I was trying crawl a website a skip a page that ended like so:
http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117
I am currently using Anemone gem in Ruby to build the crawler. I am using the skip_links_like method but my pattern never seems to match. I am trying to make this as generic as possible so it isn't dependent on subpage but just =2105925
(the digits).
I have tried /=\d+$/
and /\?.*\d+$/
but it doesn't seem to be working.
This similar to Skipping web-pages with extension pdf, zip from crawling in Anemone but I can't make it worth with digits instead of extensions.
Also, testing on http://regexpal.com/ with the pattern =\d+$
will successfully match http://misc.com/test/index.php?page=news&subpage=20060118
EDIT:
Here is the entirety of my code. I wonder if anyone can see exactly what's wrong.
require 'anemone'
...
Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true) do |anemone|
anemone.skip_links_like /\?.*\d+$/
anemone.on_every_page do |page|
pURL = page.url.to_s
puts "Now checking: " + pURL
bestGuess[pURL] = match_freq( manList, page.doc.inner_text )
puts "Successfully checked"
end
end
My output something like this:
...
Now checking: http://MISC.com/about_us/index.php?page=press_and_news&subpage=20110711
Successfully checked
...
Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true, :skip_query_strings => true) do |anemone|
anemone.on_every_page do |page|
pURL = page.url.to_s
puts "Now checking: " + pURL
bestGuess[pURL] = match_freq( manList, page.doc.inner_text )
puts "Successfully checked"
end
end
Actually the /\?.*\d+$/
works:
~> irb
> all systems are go wirble/hirb/ap/show <
ruby-1.9.2-p180 :001 > "http://hiddenwebsite.com/anonimize/index.php?page=press_and_news&subpage=20060117".match /\?.*\d+$/
=> #<MatchData "?page=press_and_news&subpage=20060117">
来源:https://stackoverflow.com/questions/8349599/rubyanemone-web-crawler-regex-to-match-urls-ending-in-a-series-of-digits