问题
I'm trying to remove whitespace from an HTML fragment between <p>
tags
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
as you can see, there always is a blank space between the <p> </p>
tags.
The problem is that the blank spaces create <br>
tags when saving the string into my database.
Methods like strip
or gsub
only remove the whitespace in the nodes, resulting in:
<p>FooBar</p> <p>barbarbar</p> <p>bla</p>
whereas I'd like to have:
<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>
I'm using:
- Nokogiri 1.5.6
- Ruby 1.9.3
- Rails
UPDATE:
Occasionally there are children nodes of the <p>
Tags that generate the same problem: white space between
Sample Code
Note: the Code normally is in one Line, I reformatted it because it would be unbearable otherwise...
<p>
<p>
<strong>Selling an Appartment</strong>
</p>
<ul>
<li>
<p>beautiful apartment!</p>
</li>
<li>
<p>near the train station</p>
</li>
.
.
.
</ul>
<ul>
<li>
<p>10 minutes away from a shopping mall </p>
</li>
<li>
<p>nice view</p>
</li>
</ul>
.
.
.
</p>
How would I strip those white spaces aswell?
SOLUTION
It turns out that I messed up using the gsub
method and didn't further investigate the possibility of using gsub
with regex
...
The simple solution was adding
data = data.gsub(/>\s+</, "><")
It deleted whitespace between all different kinds of nodes... Regex ftw!
回答1:
This is how I'd write the code:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
EOT
doc.search('p, ul, li').each { |node|
next_node = node.next_sibling
next_node.remove if next_node && next_node.text.strip == ''
}
puts doc.to_html
It results in:
<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>
Breaking it down:
doc.search('p')
looks for only the <p>
nodes in the document. Nokogiri returns a NodeSet from search
, or a nil if nothing matched. The code loops over the NodeSet, looking at each node in turn.
next_node = node.next_sibling
gets the pointer to the next node following the current <p>
node.
next_node.remove if next_node && next_node.text.strip == ''
next_node.remove
removes the current next_node
from the DOM if the next node isn't nil and its text isn't empty when stripped, in otherwords, if the node has only whitespace.
There are other techniques to locate only the TextNodes if all of them should be stripped from the document. That's risky, because it can end up deleting all blanks between tags, causing run-on sentences and joined words, which probably isn't what you want.
回答2:
A first solution can be to remove empty text nodes, a quick way to do this for your exact case can be:
require 'nokogiri'
doc = Nokogiri::HTML("<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>")
doc.css('body').first.children.map{|node| node.to_s.strip}.compact.join
This won't work for nested elements as-is but should give you a good path for start.
UPDATE:
You can actually optimise a little with:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>")
doc.children.map{|node| node.to_s.strip}.compact.join
回答3:
data.squish does the same thing and is way more readable.
来源:https://stackoverflow.com/questions/16417292/how-do-i-remove-white-space-between-html-nodes