Parse table using Nokogiri

后端 未结 3 1262
一向
一向 2021-01-06 13:16

I would like to parse a table using Nokogiri. I\'m doing it this way

def parse_table_nokogiri(html)

    doc = Nokogiri::HTML(html)

    doc.search(\'table &         


        
相关标签:
3条回答
  • 2021-01-06 14:14

    Use:

    td//text()[normalize-space()]
    

    This selects all non-white-space-only text node descendents of any td child of the current node (the tr already selected in your code).

    Or if you want to select all text-node descendents, regardles whether they are white-space-only or not:

    td//text()
    

    UPDATE:

    The OP has signaled in a comment that he is getting an unwanted td with content just a ' ' (aka non-breaking space).

    To exclude also tds whose content is composed only of (one or more) nbsp characters, use:

    td//text()[translate(normalize-space(), ' ', '')]
    
    0 讨论(0)
  • 2021-01-06 14:14

    Simple:

    doc.search('//td').each do |cell|
      puts cell.content
    end
    
    0 讨论(0)
  • 2021-01-06 14:19

    Simple (but not DRY) way of using alternation:

    require 'nokogiri'
    
    doc = Nokogiri::HTML <<ENDHTML
    <body><table><thead><tr><td>NOT THIS</td></tr></thead><tr>
      <td>foo</td>
      <td><font>bar</font></td>
    </tr></table></body>
    ENDHTML
    
    p doc.xpath( '//table/tr/td/text()|//table/tr/td/font/text()' )
    #=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
    #=>  #<Nokogiri::XML::Text:0x804286fc "bar">]
    

    See XPath with optional element in hierarchy for a more DRY answer.

    In this case, however, you can simply do:

    p doc.xpath( '//table/tr/td//text()' )
    #=> [#<Nokogiri::XML::Text:0x80428814 "foo">,
    #=>  #<Nokogiri::XML::Text:0x804286fc "bar">]
    

    Note that your table structure (and mine above) which does not have an explicit tbody element is invalid for XHTML. Given your explicit table > tr above, however, I assume that you have a reason for this.

    0 讨论(0)
提交回复
热议问题