Convert HTML to plain text (with inclusion of
s)

后端 未结 5 678
-上瘾入骨i
-上瘾入骨i 2021-02-06 01:39

Is it possible to convert HTML with Nokogiri to plain text? I also want to include
tag.

For example, given this HTML:

5条回答
  •  心在旅途
    2021-02-06 02:30

    Nothing like this exists by default, but you can easily hack something together that comes close to the desired output:

    require 'nokogiri'
    def render_to_ascii(node)
      blocks = %w[p div address]                      # els to put newlines after
      swaps  = { "br"=>"\n", "hr"=>"\n#{'-'*70}\n" }  # content to swap out
      dup = node.dup                                  # don't munge the original
    
      # Get rid of superfluous whitespace in the source
      dup.xpath('.//text()').each{ |t| t.content=t.text.gsub(/\s+/,' ') }
    
      # Swap out the swaps
      dup.css(swaps.keys.join(',')).each{ |n| n.replace( swaps[n.name] ) }
    
      # Slap a couple newlines after each block level element
      dup.css(blocks.join(',')).each{ |n| n.after("\n\n") }
    
      # Return the modified text content
      dup.text
    end
    
    frag = Nokogiri::HTML.fragment "

    It is the end of the world as we know it
    and I feel fine.

    Capische
    Buddy?
    " puts render_to_ascii(frag) #=> It is the end of the world as we know it #=> and I feel fine. #=> #=> Capische #=> ---------------------------------------------------------------------- #=> Buddy?

提交回复
热议问题