How to avoid joining all text from Nodes when scraping

前端 未结 1 1964
生来不讨喜
生来不讨喜 2020-11-22 12:38

When I scrape several related nodes from HTML or XML to extract the text, all the text is joined into one long string, making it impossible to recover the individual text st

相关标签:
1条回答
  • 2020-11-22 13:18

    This is an easily solved problem that results from not reading the documentation about how text behaves when used on a NodeSet versus a Node (or Element).

    The NodeSet documentation says text will:

    Get the inner text of all contained Node objects

    Which is what we're seeing happen with:

    doc = Nokogiri::HTML(<<EOT)
    <html>
      <body>
        <p>foo</p>
        <p>bar</p>
        <p>baz</p>
      </body>
    </html>
    EOT
    
    doc.search('p').text # => "foobarbaz"
    

    because:

    doc.search('p').class # => Nokogiri::XML::NodeSet
    

    Instead, we want to get each Node and extract its text:

    doc.search('p').first.class # => Nokogiri::XML::Element
    doc.search('p').first.text # => "foo"
    

    which can be done using map:

    doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]
    

    Ruby allows us to write that more concisely using:

    doc.search('p').map(&:text) # => ["foo", "bar", "baz"]
    

    The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.

    A Node has several aliased methods for getting at its embedded text. From the documentation:

    #content ⇒ Object

    Also known as: text, inner_text

    Returns the contents for this Node.

    0 讨论(0)
提交回复
热议问题