Ruby Regex to capture everything between two strings (inclusive)

后端未结

关注

 2  1741

情话喂你 2021-01-07 04:05

I\'m trying to sanitize some HTML and just remove a single tag (and I\'d really like to avoid using nokogiri, etc). So I\'ve got the following string appearing I want to get

2条回答

花落未央 (楼主)

2021-01-07 04:43
Because it adds another dependency and slows my work down. Makes things more complicated. Plus, this solution is applicable to more than just HTML tags. My start and end strings can be anything.

I used to think the same way until I got a job writing spiders and web-site analytics, then writing a big RSS-aggregation system -- A parser was the only way out of that madness. Without it the work would never have been finished.

Yes, regex are good and useful, but there are dragons waiting for you. For instance, this common string will cause problems:
```
'foo'
```
The regex /
(.*?)<\/div>/m will return:
```
"foo"
```
This malformed, but renderable HTML:
```
foo
```
is even worse:
```
'foo'[/(.*?)<\/div>/m]
=> nil
```
Whereas, a parser can deal with both:
```
require 'nokogiri'
[
  'foo',
  'foo'
].each do |html|
  doc = Nokogiri.HTML(html)
  puts doc.at('div.the_class').text
end
```
Outputs:
```
foo
foo
```
Yes, your start and end strings could be anything, but there are well-recognized tools for parsing HTML/XML, and as your task grows the weaknesses in using regex will become more apparent.

And, yes, it's possible to have a parser fail. I've had to process RSS feeds that were so badly malformed the parser blew up, but a bit of pre-processing fixed the problem.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...