Replace tags using Nokogiri - quicker way?

时光怂恿深爱的人放手 提交于 2020-01-07 05:07:05

问题


I have the following HTML in a variable named html_data where I wish to replace <img> tags with <a> tags and the src parameters of the "img" tags becomes href of the "a" tags.

Existing HTML:

<!DOCTYPE html>
<html>
   <head>
      <title>Learning Nokogiri</title>
   </head>
   <body marginwidth="6">
      <div valign="top">
         <div class="some_class">
            <div class="test">
               <img src="apple.png" alt="Apple" height="42" width="42">
               <div style="white-space: pre-wrap;"></div>
            </div>
         </div>
      </div>
   </body>
</html>

This is my solution A:

nokogiri_html = Nokogiri::HTML(html_data)
nokogiri_html("img").each { |tag|
        a_tag = Nokogiri::XML::Node.new("a", nokogiri_html)
        a_tag["href"] = tag["src"]
        tag.add_next_sibling(a_tag)
        tag.remove()
}

puts 'nokogiri_html is', nokogiri_html

This is my solution B:

nokogiri_html = Nokogiri::HTML(html_data)
nokogiri_html("img").each { |tag|
        tag.name= "a";
        tag.set_attribute("href" , tag["src"])
}

puts 'nokogiri_html is', nokogiri_html

While solution A works fine, I am looking if there is a quicker/direct way to replace the tags using Nokogiri. With solution B, my "img" tag does get replaced with the "a" tag, but the properties of the "img" tag still remains inside the "a" tag. Below is the result of Solution B:

<!DOCTYPE html>
<html>
   <body>
      <p>["\n", "\n", "   </p>
      \n", "      
      <title>Learning Nokogiri</title>
      \n", "   \n", "   \n", "      
      <div valign='\"top\"'>
         \n", "         
         <div class='\"some_class\"'>
            \n", "            
            <div class='\"test\"'>
               \n", "               <a src="%5C%22apple.png%5C%22" alt='\"Apple\"' height='\"42\"' width='\"42\"' href="%5C%22apple.png%5C%22"></a>\n", "               
               <div style='\"white-space:' pre-wrap></div>
               \n", "            
            </div>
            \n", "         
         </div>
         \n", "      
      </div>
      \n", "   \n", ""]
   </body>
</html>

Is there a way to replace the tags faster in HTML using Nokogiri? Also how can remove the "\n"s am getting in the result?


回答1:


First, please strip your sample data (HTML) to the barest amount necessary to demonstrate the problem.

Here's the basics of doing what you want:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<!DOCTYPE html>
<html>
   <body>
     <img src="apple.png" alt="Apple" height="42" width="42">
   </body>
</html>
EOT

doc.search('img').each do |img|
  src, alt = %w[src alt].map{ |p| img[p] }
  img.replace("<a href='#{ src }'>#{ alt }</a>")
end

doc.to_html
# => "<!DOCTYPE html>\n<html>\n   <body>\n     <a href=\"apple.png\">Apple</a>\n   </body>\n</html>\n"

puts doc.to_html
# >> <!DOCTYPE html>
# >> <html>
# >>    <body>
# >>      <a href="apple.png">Apple</a>
# >>    </body>
# >> </html>

Doing it this way allows Nokogiri to replace nodes cleanly.

It's not necessary to do all this rigamarole:

a_tag = Nokogiri::XML::Node.new("a", nokogiri_html)
a_tag["href"] = tag["src"]
tag.add_next_sibling(a_tag)
tag.remove()

Instead, create a string that is the tag you want to use and let Nokogiri convert the string to a node and replace the old node:

src, alt = %w[src alt].map{ |p| img[p] }
img.replace("<a href='#{ src }'>#{ alt }</a>")

It's not necessary to strip extraneous whitespace between nodes. It can affect the look of the HTML but browsers will gobble that extra whitespace and not display it.

Nokogiri can be told to not output the inter-node whitespace, resulting in a compressed/fugly output, but how to do that is a separate question.



来源:https://stackoverflow.com/questions/29656223/replace-tags-using-nokogiri-quicker-way

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!