How to navigate the DOM using Nokogiri

元气小坏坏 提交于 2020-01-01 09:24:16

问题


I'm trying to fill the variables parent_element_h1 and parent_element_h2. Can anyone help me use Nokogiri to get the information I need into those variables?

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='para-1'>A</p>
      <div class='block' id='X1'>
        <h1>Foo</h1>
        <p id='para-2'>B</p>
      </div>
      <p id='para-3'>C</p>
      <h2>Bar</h2>
      <p id='para-4'>D</p>
      <p id='para-5'>E</p>
      <div class='block' id='X2'>
        <p id='para-6'>F</p>
      </div>
    </body>
  </html>"
HTML_END

parent = value.css('body').first

# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
start_here = parent.at('div.block#X2')

# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
parent_element_h1 = 

# this should be a Nokogiri::XML::Element of the nearest, previous h2. 
# in this example it's the one with the value 'Bar'
parent_element_h2 =

Please note: The start_here element could be anywhere inside the document. The HTML data is just an example. That said, the headers <h1> and <h2> could be a sibling of start_here or a child of a sibling of start_here.

The following recursive method is a good starting point, but it doesn't work on <h1> because it's a child of a sibling of start_here:

def search_element(_block,_style)
  unless _block.nil?
    if _block.name == _style
      return _block
    else
      search_element(_block.previous,_style)
    end
  else
    return false
  end
end

parent_element_h1 = search_element(start_here,'h1')
parent_element_h2 = search_element(start_here,'h2')

After accepting an answer, I came up with my own solution. It works like a charm and I think it's pretty cool.


回答1:


I came across this a few years too late I suppose, but felt compelled to post because all the other solutions are way too complicated.

It's a single statement with XPath:

start = doc.at('div.block#X2')

start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]')
#=> <h2>Foo</h2>    

start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]')
#=> <h2>Bar</h2>

This accommodates either direct previous siblings or children of previous siblings. Regardless of which one matches, the last() predicate ensures that you get the closest previous match.




回答2:


The approach I would take (if I am understanding your problem) is to use XPath or CSS to search for your "start_here" element and the parent element that you want to search under. Then, recursively walk the tree starting at parent, stopping when you hit the "start_here" element, and holding onto the last element that matches your style along the way.

Something like:

parent = value.search("//body").first
div = value.search("//div[@id = 'X2']").first

find = FindPriorTo.new(div)

assert_equal('Foo', find.find_from(parent, 'h1').text)
assert_equal('Bar', find.find_from(parent, 'h2').text) 

Where FindPriorTo is a simple class to handle the recursion:

class FindPriorTo
  def initialize(stop_element)
    @stop_element = stop_element
  end

  def find_from(parent, style)
    @should_stop = nil
    @last_style  = nil

    recursive_search(parent, style)
  end

  def recursive_search(parent, style)
    parent.children.each do |ch|
      recursive_search(ch, style)
      return @last_style if @should_stop

      @should_stop = (ch == @stop_element)
      @last_style = ch if ch.name == style
    end

    @last_style    
  end

end

If this approach isn't scalable enough, then you might be able to optimize things by rewriting the recursive_search to not use recursion, and also pass in both of the styles you are looking for and keep track of last found, so you don't have to traverse the tree an extra time.

I'd also say try monkey patching Node to hook on when the document is getting parsed, but it looks like all of that is written in C. Perhaps you might be better served using something other than Nokogiri that has a native Ruby SAX parser (maybe REXML), or if speed is your real concern, do the search portion in C/C++ using Xerces or similar. I don't know how well these will deal with parsing HTML though.




回答3:


Maybe this will do it. I'm not sure about the performance and if there might be some cases that I haven't thought of.

def find(root, start, tag)
    ps, res = start, nil
    until res or (ps == root)
        ps  = ps.previous || ps.parent
        res = ps.css(tag).last
        res ||= ps.name == tag ? ps : nil
    end
    res || "Not found!"
end

parent_element_h1 =  find(parent, start_here, 'h1')



回答4:


This is my own solution (kudos to my co-worker for helping me on this one!) using a recursive method to parse all elements regardless of being a sibling or a child of another sibling.

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='para-1'>A</p>
      <div class='block' id='X1'>
        <h1>Foo</h1>
        <p id='para-2'>B</p>
      </div>
      <p id='para-3'>C</p>
      <h2>Bar</h2>
      <p id='para-4'>D</p>
      <p id='para-5'>E</p>
      <div class='block' id='X2'>
        <p id='para-6'>F</p>
      </div>
    </body>
  </html>"
HTML_END

parent = value.css('body').first

# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
@start_here = parent.at('div.block#X2')

# Search for parent elements of kind "_style" starting from _start_element
def search_for_parent_element(_start_element, _style)
  unless _start_element.nil?
    # have we already found what we're looking for?
    if _start_element.name == _style
      return _start_element
    end
    # _start_element is a div.block and not the _start_element itself
    if _start_element[:class] == "block" && _start_element[:id] != @start_here[:id]
      # begin recursion with last child inside div.block
      from_child = search_for_parent_element(_start_element.children.last, _style)
      if(from_child)
        return from_child
      end
    end
    # begin recursion with previous element
    from_child = search_for_parent_element(_start_element.previous, _style) 
    return from_child ? from_child : false
  else
    return false
  end
end

# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
puts parent_element_h1 = search_for_parent_element(@start_here,"h1")

# this should be a Nokogiri::XML::Element of the nearest, previous h2. 
# in this example it's the one with the value 'Bar'
puts parent_element_h2 = search_for_parent_element(@start_here,"h2")

You can copy/paste it an run it like it is as a ruby script.




回答5:


If you don't know the relationship between elements, you can search for them this way ( anywhere in the document ):


# html code
text = "insert your html here"
# get doc object
doc = Nokogiri::HTML(text)
# get elements with the specified tag
elements = doc.search("//your_tag")

If, however, you need to submit a form, you should use mechanize:


# create mech object
mech = WWW::Mechanize.new
# load site
mech.get("address")
# select a form, in this case, I select the first form. You can select the one you need 
# from the array
form = mech.page.forms.first
# you fill the fields like this: form.name_of_the_field
form.element_name  = value
form.other_element = other_value



回答6:


You can search the descendants of a Nokogiri HTML::Element using CSS selectors. You can traverse ancestors with the .parent method.

parent_element_h1 = value.css("h1").first.parent
parent_element_h2 = value.css("h2").first.parent


来源:https://stackoverflow.com/questions/657468/how-to-navigate-the-dom-using-nokogiri

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!