Getting all links of a webpage using Ruby

前端 未结 5 1434
盖世英雄少女心
盖世英雄少女心 2021-02-08 06:36

I\'m trying to retrieve every external link of a webpage using Ruby. I\'m using String.scan with this regex:

/href=\"https?:[^\"]*|href=\'https?:[^\         


        
5条回答
  •  逝去的感伤
    2021-02-08 07:17

    Using regular expressions is fine for a quick and dirty script, but Nokogiri is very simple to use:

    require 'nokogiri'
    require 'open-uri'
    
    fail("Usage: extract_links URL [URL ...]") if ARGV.empty?
    
    ARGV.each do |url|
      doc = Nokogiri::HTML(open(url))
      hrefs = doc.css("a").map do |link|
        if (href = link.attr("href")) && !href.empty?
          URI::join(url, href)
        end
      end.compact.uniq
      STDOUT.puts(hrefs.join("\n"))
    end
    

    If you want just the method, refactor it a little bit to your needs:

    def get_links(url)
      Nokogiri::HTML(open(url).read).css("a").map do |link|
        if (href = link.attr("href")) && href.match(/^https?:/)
          href
        end
      end.compact
    end
    

提交回复
热议问题