Convert HTML to plain text (with inclusion of
s)

后端 未结 5 667
-上瘾入骨i
-上瘾入骨i 2021-02-06 01:39

Is it possible to convert HTML with Nokogiri to plain text? I also want to include
tag.

For example, given this HTML:

相关标签:
5条回答
  • 2021-02-06 02:10

    Try

    Nokogiri::HTML(my_html.gsub('<br />',"\n")).text
    
    0 讨论(0)
  • 2021-02-06 02:12

    Instead of writing complex regexp I used Nokogiri.

    Working solution (K.I.S.S!):

    def strip_html(str)
      document = Nokogiri::HTML.parse(str)
      document.css("br").each { |node| node.replace("\n") }
      document.text
    end
    
    0 讨论(0)
  • 2021-02-06 02:24

    If you use HAML you can solve html converting by putting html with 'raw' option, f.e.

          = raw @product.short_description
    
    0 讨论(0)
  • 2021-02-06 02:28

    Nokogiri will strip out links, so I use this first to preserve links in the text version:

    html_version.gsub!(/<a href.*(http:[^"']+).*>(.*)<\/a>/i) { "#{$2}\n#{$1}" }
    

    that will turn this:

    <a href = "http://google.com">link to google</a>
    

    to this:

    link to google
    http://google.com
    
    0 讨论(0)
  • 2021-02-06 02:30

    Nothing like this exists by default, but you can easily hack something together that comes close to the desired output:

    require 'nokogiri'
    def render_to_ascii(node)
      blocks = %w[p div address]                      # els to put newlines after
      swaps  = { "br"=>"\n", "hr"=>"\n#{'-'*70}\n" }  # content to swap out
      dup = node.dup                                  # don't munge the original
    
      # Get rid of superfluous whitespace in the source
      dup.xpath('.//text()').each{ |t| t.content=t.text.gsub(/\s+/,' ') }
    
      # Swap out the swaps
      dup.css(swaps.keys.join(',')).each{ |n| n.replace( swaps[n.name] ) }
    
      # Slap a couple newlines after each block level element
      dup.css(blocks.join(',')).each{ |n| n.after("\n\n") }
    
      # Return the modified text content
      dup.text
    end
    
    frag = Nokogiri::HTML.fragment "<p>It is the end of the world
      as         we
      know it<br>and <i>I</i> <strong>feel</strong>
      <a href='blah'>fine</a>.</p><div>Capische<hr>Buddy?</div>"
    
    puts render_to_ascii(frag)
    #=> It is the end of the world as we know it
    #=> and I feel fine.
    #=> 
    #=> Capische
    #=> ----------------------------------------------------------------------
    #=> Buddy?
    
    0 讨论(0)
提交回复
热议问题