How do I remove white space between HTML nodes?

瘦欲@ 提交于 2019-12-23 17:15:31

问题


I'm trying to remove whitespace from an HTML fragment between <p> tags

<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>

as you can see, there always is a blank space between the <p> </p> tags.

The problem is that the blank spaces create <br> tags when saving the string into my database. Methods like strip or gsub only remove the whitespace in the nodes, resulting in:

<p>FooBar</p> <p>barbarbar</p> <p>bla</p>

whereas I'd like to have:

<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>

I'm using:

  • Nokogiri 1.5.6
  • Ruby 1.9.3
  • Rails

UPDATE:

Occasionally there are children nodes of the <p>Tags that generate the same problem: white space between

Sample Code

Note: the Code normally is in one Line, I reformatted it because it would be unbearable otherwise...

<p>
  <p>
    <strong>Selling an Appartment</strong>
  </p>
  <ul>
    <li>
      <p>beautiful apartment!</p>
    </li>
    <li>
      <p>near the train station</p>
    </li>
    .
    .
    .
  </ul>
  <ul>
    <li> 
      <p>10 minutes away from a shopping mall </p>
    </li>
    <li>
      <p>nice view</p>
    </li>
  </ul>
  .
  .
  .
</p>

How would I strip those white spaces aswell?

SOLUTION

It turns out that I messed up using the gsub method and didn't further investigate the possibility of using gsub with regex...

The simple solution was adding

data = data.gsub(/>\s+</, "><")

It deleted whitespace between all different kinds of nodes... Regex ftw!


回答1:


This is how I'd write the code:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
EOT

doc.search('p, ul, li').each { |node| 
  next_node = node.next_sibling
  next_node.remove if next_node && next_node.text.strip == ''
}

puts doc.to_html

It results in:

<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>

Breaking it down:

doc.search('p')

looks for only the <p> nodes in the document. Nokogiri returns a NodeSet from search, or a nil if nothing matched. The code loops over the NodeSet, looking at each node in turn.

next_node = node.next_sibling

gets the pointer to the next node following the current <p> node.

next_node.remove if next_node && next_node.text.strip == ''

next_node.remove removes the current next_node from the DOM if the next node isn't nil and its text isn't empty when stripped, in otherwords, if the node has only whitespace.

There are other techniques to locate only the TextNodes if all of them should be stripped from the document. That's risky, because it can end up deleting all blanks between tags, causing run-on sentences and joined words, which probably isn't what you want.




回答2:


A first solution can be to remove empty text nodes, a quick way to do this for your exact case can be:

require 'nokogiri'
doc = Nokogiri::HTML("<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>")
doc.css('body').first.children.map{|node| node.to_s.strip}.compact.join

This won't work for nested elements as-is but should give you a good path for start.

UPDATE:

You can actually optimise a little with:

require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>")
doc.children.map{|node| node.to_s.strip}.compact.join



回答3:


data.squish does the same thing and is way more readable.



来源:https://stackoverflow.com/questions/16417292/how-do-i-remove-white-space-between-html-nodes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!