Ruby Regex to capture everything between two strings (inclusive)

为君一笑 提交于 2019-12-02 01:45:09

I believe you're looking for an non-greedy regex, like this:

/<div class="the_class">(.*?)<\/div>/m

Note the added ?. Now, the capturing group will capture as little as possible (non-greedy), instead of as most as possible (greedy).

Because it adds another dependency and slows my work down. Makes things more complicated. Plus, this solution is applicable to more than just HTML tags. My start and end strings can be anything.

I used to think the same way until I got a job writing spiders and web-site analytics, then writing a big RSS-aggregation system -- A parser was the only way out of that madness. Without it the work would never have been finished.

Yes, regex are good and useful, but there are dragons waiting for you. For instance, this common string will cause problems:

'<div class="the_class"><div class="inner_div">foo</div></div>'

The regex /<div class="the_class">(.*?)<\/div>/m will return:

"<div class=\"the_class\"><div class=\"inner_div\">foo</div>"

This malformed, but renderable HTML:

<div class="the_class"><div class="inner_div">foo

is even worse:

'<div class="the_class"><div class="inner_div">foo'[/<div class="the_class">(.*?)<\/div>/m]
=> nil

Whereas, a parser can deal with both:

require 'nokogiri'
[
  '<div class="the_class"><div class="inner_div">foo</div></div>',
  '<div class="the_class"><div class="inner_div">foo'
].each do |html|
  doc = Nokogiri.HTML(html)
  puts doc.at('div.the_class').text
end

Outputs:

foo
foo

Yes, your start and end strings could be anything, but there are well-recognized tools for parsing HTML/XML, and as your task grows the weaknesses in using regex will become more apparent.

And, yes, it's possible to have a parser fail. I've had to process RSS feeds that were so badly malformed the parser blew up, but a bit of pre-processing fixed the problem.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!