Is it possible to convert HTML with Nokogiri to plain text? I also want to include
tag.
For example, given this HTML:
Nothing like this exists by default, but you can easily hack something together that comes close to the desired output:
require 'nokogiri'
def render_to_ascii(node)
blocks = %w[p div address] # els to put newlines after
swaps = { "br"=>"\n", "hr"=>"\n#{'-'*70}\n" } # content to swap out
dup = node.dup # don't munge the original
# Get rid of superfluous whitespace in the source
dup.xpath('.//text()').each{ |t| t.content=t.text.gsub(/\s+/,' ') }
# Swap out the swaps
dup.css(swaps.keys.join(',')).each{ |n| n.replace( swaps[n.name] ) }
# Slap a couple newlines after each block level element
dup.css(blocks.join(',')).each{ |n| n.after("\n\n") }
# Return the modified text content
dup.text
end
frag = Nokogiri::HTML.fragment "It is the end of the world
as we
know it
and I feel
fine.
Capische
Buddy?"
puts render_to_ascii(frag)
#=> It is the end of the world as we know it
#=> and I feel fine.
#=>
#=> Capische
#=> ----------------------------------------------------------------------
#=> Buddy?