问题
I'm trying to fill the variables parent_element_h1
and parent_element_h2
. Can anyone help me use Nokogiri to get the information I need into those variables?
require 'rubygems'
require 'nokogiri'
value = Nokogiri::HTML.parse(<<-HTML_END)
"<html>
<body>
<p id='para-1'>A</p>
<div class='block' id='X1'>
<h1>Foo</h1>
<p id='para-2'>B</p>
</div>
<p id='para-3'>C</p>
<h2>Bar</h2>
<p id='para-4'>D</p>
<p id='para-5'>E</p>
<div class='block' id='X2'>
<p id='para-6'>F</p>
</div>
</body>
</html>"
HTML_END
parent = value.css('body').first
# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
start_here = parent.at('div.block#X2')
# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
parent_element_h1 =
# this should be a Nokogiri::XML::Element of the nearest, previous h2.
# in this example it's the one with the value 'Bar'
parent_element_h2 =
Please note: The start_here
element could be anywhere inside the document. The HTML data is just an example. That said, the headers <h1>
and <h2>
could be a sibling of start_here
or a child of a sibling of start_here
.
The following recursive method is a good starting point, but it doesn't work on <h1>
because it's a child of a sibling of start_here
:
def search_element(_block,_style)
unless _block.nil?
if _block.name == _style
return _block
else
search_element(_block.previous,_style)
end
else
return false
end
end
parent_element_h1 = search_element(start_here,'h1')
parent_element_h2 = search_element(start_here,'h2')
After accepting an answer, I came up with my own solution. It works like a charm and I think it's pretty cool.
回答1:
I came across this a few years too late I suppose, but felt compelled to post because all the other solutions are way too complicated.
It's a single statement with XPath:
start = doc.at('div.block#X2')
start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]')
#=> <h2>Foo</h2>
start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]')
#=> <h2>Bar</h2>
This accommodates either direct previous siblings or children of previous siblings. Regardless of which one matches, the last()
predicate ensures that you get the closest previous match.
回答2:
The approach I would take (if I am understanding your problem) is to use XPath or CSS to search for your "start_here" element and the parent element that you want to search under. Then, recursively walk the tree starting at parent, stopping when you hit the "start_here" element, and holding onto the last element that matches your style along the way.
Something like:
parent = value.search("//body").first
div = value.search("//div[@id = 'X2']").first
find = FindPriorTo.new(div)
assert_equal('Foo', find.find_from(parent, 'h1').text)
assert_equal('Bar', find.find_from(parent, 'h2').text)
Where FindPriorTo
is a simple class to handle the recursion:
class FindPriorTo
def initialize(stop_element)
@stop_element = stop_element
end
def find_from(parent, style)
@should_stop = nil
@last_style = nil
recursive_search(parent, style)
end
def recursive_search(parent, style)
parent.children.each do |ch|
recursive_search(ch, style)
return @last_style if @should_stop
@should_stop = (ch == @stop_element)
@last_style = ch if ch.name == style
end
@last_style
end
end
If this approach isn't scalable enough, then you might be able to optimize things by rewriting the recursive_search
to not use recursion, and also pass in both of the styles you are looking for and keep track of last found, so you don't have to traverse the tree an extra time.
I'd also say try monkey patching Node to hook on when the document is getting parsed, but it looks like all of that is written in C. Perhaps you might be better served using something other than Nokogiri that has a native Ruby SAX parser (maybe REXML), or if speed is your real concern, do the search portion in C/C++ using Xerces or similar. I don't know how well these will deal with parsing HTML though.
回答3:
Maybe this will do it. I'm not sure about the performance and if there might be some cases that I haven't thought of.
def find(root, start, tag)
ps, res = start, nil
until res or (ps == root)
ps = ps.previous || ps.parent
res = ps.css(tag).last
res ||= ps.name == tag ? ps : nil
end
res || "Not found!"
end
parent_element_h1 = find(parent, start_here, 'h1')
回答4:
This is my own solution (kudos to my co-worker for helping me on this one!) using a recursive method to parse all elements regardless of being a sibling or a child of another sibling.
require 'rubygems'
require 'nokogiri'
value = Nokogiri::HTML.parse(<<-HTML_END)
"<html>
<body>
<p id='para-1'>A</p>
<div class='block' id='X1'>
<h1>Foo</h1>
<p id='para-2'>B</p>
</div>
<p id='para-3'>C</p>
<h2>Bar</h2>
<p id='para-4'>D</p>
<p id='para-5'>E</p>
<div class='block' id='X2'>
<p id='para-6'>F</p>
</div>
</body>
</html>"
HTML_END
parent = value.css('body').first
# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
@start_here = parent.at('div.block#X2')
# Search for parent elements of kind "_style" starting from _start_element
def search_for_parent_element(_start_element, _style)
unless _start_element.nil?
# have we already found what we're looking for?
if _start_element.name == _style
return _start_element
end
# _start_element is a div.block and not the _start_element itself
if _start_element[:class] == "block" && _start_element[:id] != @start_here[:id]
# begin recursion with last child inside div.block
from_child = search_for_parent_element(_start_element.children.last, _style)
if(from_child)
return from_child
end
end
# begin recursion with previous element
from_child = search_for_parent_element(_start_element.previous, _style)
return from_child ? from_child : false
else
return false
end
end
# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
puts parent_element_h1 = search_for_parent_element(@start_here,"h1")
# this should be a Nokogiri::XML::Element of the nearest, previous h2.
# in this example it's the one with the value 'Bar'
puts parent_element_h2 = search_for_parent_element(@start_here,"h2")
You can copy/paste it an run it like it is as a ruby script.
回答5:
If you don't know the relationship between elements, you can search for them this way ( anywhere in the document ):
# html code
text = "insert your html here"
# get doc object
doc = Nokogiri::HTML(text)
# get elements with the specified tag
elements = doc.search("//your_tag")
If, however, you need to submit a form, you should use mechanize:
# create mech object
mech = WWW::Mechanize.new
# load site
mech.get("address")
# select a form, in this case, I select the first form. You can select the one you need
# from the array
form = mech.page.forms.first
# you fill the fields like this: form.name_of_the_field
form.element_name = value
form.other_element = other_value
回答6:
You can search the descendants of a Nokogiri HTML::Element
using CSS selectors. You can traverse ancestors with the .parent
method.
parent_element_h1 = value.css("h1").first.parent
parent_element_h2 = value.css("h2").first.parent
来源:https://stackoverflow.com/questions/657468/how-to-navigate-the-dom-using-nokogiri