问题

I'm trying to remove whitespace from an HTML fragment between  tags

<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>

as you can see, there always is a blank space between the   tags.

The problem is that the blank spaces create   tags when saving the string into my database. Methods like strip or gsub only remove the whitespace in the nodes, resulting in:

<p>FooBar</p> <p>barbarbar</p> <p>bla</p>

whereas I'd like to have:

<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>

I'm using:

Nokogiri 1.5.6
Ruby 1.9.3
Rails

UPDATE:

Occasionally there are children nodes of the Tags that generate the same problem: white space between

Sample Code

Note: the Code normally is in one Line, I reformatted it because it would be unbearable otherwise...

<p>
  <p>
    <strong>Selling an Appartment</strong>
  </p>
  <ul>
    <li>
      <p>beautiful apartment!</p>
    </li>
    <li>
      <p>near the train station</p>
    </li>
    .
    .
    .
  </ul>
  <ul>
    <li> 
      <p>10 minutes away from a shopping mall </p>
    </li>
    <li>
      <p>nice view</p>
    </li>
  </ul>
  .
  .
  .
</p>

How would I strip those white spaces aswell?

SOLUTION

It turns out that I messed up using the gsub method and didn't further investigate the possibility of using gsub with regex...

The simple solution was adding

data = data.gsub(/>\s+</, "><")

It deleted whitespace between all different kinds of nodes... Regex ftw!

回答1:

This is how I'd write the code:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
EOT

doc.search('p, ul, li').each { |node| 
  next_node = node.next_sibling
  next_node.remove if next_node && next_node.text.strip == ''
}

puts doc.to_html

It results in:

<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>

Breaking it down:

doc.search('p')

looks for only the  nodes in the document. Nokogiri returns a NodeSet from search, or a nil if nothing matched. The code loops over the NodeSet, looking at each node in turn.

next_node = node.next_sibling

gets the pointer to the next node following the current  node.

next_node.remove if next_node && next_node.text.strip == ''

next_node.remove removes the current next_node from the DOM if the next node isn't nil and its text isn't empty when stripped, in otherwords, if the node has only whitespace.

There are other techniques to locate only the TextNodes if all of them should be stripped from the document. That's risky, because it can end up deleting all blanks between tags, causing run-on sentences and joined words, which probably isn't what you want.

回答2:

A first solution can be to remove empty text nodes, a quick way to do this for your exact case can be:

require 'nokogiri'
doc = Nokogiri::HTML("<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>")
doc.css('body').first.children.map{|node| node.to_s.strip}.compact.join

This won't work for nested elements as-is but should give you a good path for start.

UPDATE:

You can actually optimise a little with:

require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>")
doc.children.map{|node| node.to_s.strip}.compact.join