How to extract URLs from text

前端 未结 6 2100
有刺的猬
有刺的猬 2020-12-03 02:56

How do I extract all URLs from a plain text file in Ruby?

I tried some libraries but they fail in some cases. What\'s the best way?

相关标签:
6条回答
  • 2020-12-03 03:30

    If your input looks similar to this:

    "http://i.imgur.com/c31IkbM.gifv;http://i.imgur.com/c31IkbM.gifvhttp://i.imgur.com/c31IkbM.gifv"
    

    i.e. URLs do not necessarily have white space around them, can be delimited by any delimiter, or have no delimiter between them at all, you can use the following approach:

    def process_images(raw_input)
      return [] if raw_input.nil?
      urls = raw_input.split('http')
      urls.shift
      urls.map { |url| "http#{url}".strip.split(/[\s\,\;]/)[0] }
    end
    

    Hope it helps!

    0 讨论(0)
  • 2020-12-03 03:34

    You can use regex and .scan()

    string.scan(/(https?:\/\/([-\w\.]+)+(:\d+)?(\/([\w\/_\.]*(\?\S+)?)?)?)/)
    

    You can get started with that regex and adjust it according to your needs.

    0 讨论(0)
  • 2020-12-03 03:40
    require 'uri'    
    foo = #<URI::HTTP:0x007f91c76ebad0 URL:http://foobar/00u0u_gKHnmtWe0Jk_600x450.jpg>
    foo.to_s
    => "http://foobar/00u0u_gKHnmtWe0Jk_600x450.jpg"
    

    edit: explanation

    For those who are having problems parsing URI's through JSON responses or by using a scraping tool like Nokogiri or Mechanize, this solution worked for me.

    0 讨论(0)
  • 2020-12-03 03:43

    If you like using what's already provided for you in Ruby:

    require "uri"
    URI.extract("text here http://foo.example.org/bla and here mailto:test@example.com and here also.")
    # => ["http://foo.example.org/bla", "mailto:test@example.com"]
    

    Read more: http://railsapi.com/doc/ruby-v1.8/classes/URI.html#M004495

    0 讨论(0)
  • 2020-12-03 03:51

    What cases are failing?

    According to the library regexpert, you can use

    regexp = /(^$)|(^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix
    

    and then perform a scan on the text.

    EDIT: Seems like the regexp supports the empty string. Just remove the initial (^$) and you're done

    0 讨论(0)
  • 2020-12-03 03:51

    I've used twitter-text gem

    require "twitter-text"
    class UrlParser
        include Twitter::Extractor
    end
    
    urls = UrlParser.new.extract_urls("http://stackoverflow.com")
    puts urls.inspect
    
    0 讨论(0)
提交回复
热议问题