问题
I have the following HTML in a variable named html_data
where I wish to replace <img>
tags with <a>
tags and the src
parameters of the "img" tags becomes href
of the "a" tags.
Existing HTML:
<!DOCTYPE html>
<html>
<head>
<title>Learning Nokogiri</title>
</head>
<body marginwidth="6">
<div valign="top">
<div class="some_class">
<div class="test">
<img src="apple.png" alt="Apple" height="42" width="42">
<div style="white-space: pre-wrap;"></div>
</div>
</div>
</div>
</body>
</html>
This is my solution A:
nokogiri_html = Nokogiri::HTML(html_data)
nokogiri_html("img").each { |tag|
a_tag = Nokogiri::XML::Node.new("a", nokogiri_html)
a_tag["href"] = tag["src"]
tag.add_next_sibling(a_tag)
tag.remove()
}
puts 'nokogiri_html is', nokogiri_html
This is my solution B:
nokogiri_html = Nokogiri::HTML(html_data)
nokogiri_html("img").each { |tag|
tag.name= "a";
tag.set_attribute("href" , tag["src"])
}
puts 'nokogiri_html is', nokogiri_html
While solution A works fine, I am looking if there is a quicker/direct way to replace the tags using Nokogiri. With solution B, my "img" tag does get replaced with the "a" tag, but the properties of the "img" tag still remains inside the "a" tag. Below is the result of Solution B:
<!DOCTYPE html>
<html>
<body>
<p>["\n", "\n", " </p>
\n", "
<title>Learning Nokogiri</title>
\n", " \n", " \n", "
<div valign='\"top\"'>
\n", "
<div class='\"some_class\"'>
\n", "
<div class='\"test\"'>
\n", " <a src="%5C%22apple.png%5C%22" alt='\"Apple\"' height='\"42\"' width='\"42\"' href="%5C%22apple.png%5C%22"></a>\n", "
<div style='\"white-space:' pre-wrap></div>
\n", "
</div>
\n", "
</div>
\n", "
</div>
\n", " \n", ""]
</body>
</html>
Is there a way to replace the tags faster in HTML using Nokogiri? Also how can remove the "\n"s am getting in the result?
回答1:
First, please strip your sample data (HTML) to the barest amount necessary to demonstrate the problem.
Here's the basics of doing what you want:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<!DOCTYPE html>
<html>
<body>
<img src="apple.png" alt="Apple" height="42" width="42">
</body>
</html>
EOT
doc.search('img').each do |img|
src, alt = %w[src alt].map{ |p| img[p] }
img.replace("<a href='#{ src }'>#{ alt }</a>")
end
doc.to_html
# => "<!DOCTYPE html>\n<html>\n <body>\n <a href=\"apple.png\">Apple</a>\n </body>\n</html>\n"
puts doc.to_html
# >> <!DOCTYPE html>
# >> <html>
# >> <body>
# >> <a href="apple.png">Apple</a>
# >> </body>
# >> </html>
Doing it this way allows Nokogiri to replace nodes cleanly.
It's not necessary to do all this rigamarole:
a_tag = Nokogiri::XML::Node.new("a", nokogiri_html)
a_tag["href"] = tag["src"]
tag.add_next_sibling(a_tag)
tag.remove()
Instead, create a string that is the tag you want to use and let Nokogiri convert the string to a node and replace the old node:
src, alt = %w[src alt].map{ |p| img[p] }
img.replace("<a href='#{ src }'>#{ alt }</a>")
It's not necessary to strip extraneous whitespace between nodes. It can affect the look of the HTML but browsers will gobble that extra whitespace and not display it.
Nokogiri can be told to not output the inter-node whitespace, resulting in a compressed/fugly output, but how to do that is a separate question.
来源:https://stackoverflow.com/questions/29656223/replace-tags-using-nokogiri-quicker-way