How to parse consecutive tags with Nokogiri?

你离开我真会死。 提交于 2019-12-30 02:33:08

问题


I have HTML code like this:

<div id="first">
<dt>Label1</dt>
<dd>Value1</dd>

<dt>Label2</dt>
<dd>Value2</dd>

...
</div>

My code does not work.

doc.css("first").each do |item|
  label = item.css("dt")
  value = item.css("dd")
end

Show all the <dt> tags firsts and then the <dd> tags and I need "label: value"


回答1:


First of all, your HTML should have the <dt> and <dd> elements inside a <dl>:

<div id="first">
    <dl>
        <dt>Label1</dt>
        <dd>Value1</dd>
        <dt>Label2</dt>
        <dd>Value2</dd>
        ...
    </dl>
</div>

but that won't change how you parse it. You want to find the <dt>s and iterate over them, then at each <dt> you can use next_element to get the <dd>; something like this:

doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
    puts "#{node.text}: #{node.next_element.text}"
end

That should work as long as the structure matches your example.




回答2:


Under the assumption that some <dt> may have multiple <dd>, you want to find all <dt> and then (for each) find the following <dd> before the next <dt>. This is pretty easy to do in pure Ruby, but more fun to do in just XPath. ;)

Given this setup:

require 'nokogiri'   
html = '<dl id="first">
  <dt>Label1</dt><dd>Value1</dd>
  <dt>Label2</dt><dd>Value2</dd>
  <dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd>
  <dt>Label4</dt><dd>Value4</dd>
</dl>'    
doc = Nokogiri.HTML(html)

Using no XPath:

doc.css('dt').each do |dt|
  dds = []
  n = dt.next_element
  begin
    dds << n
    n = n.next_element
  end while n && n.name=='dd'
  p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]

Using a Little XPath:

doc.css('dt').each do |dt|
  dds = dt.xpath('following-sibling::*').chunk{ |n| n.name }.first.last
  p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]

Using Lotsa XPath:

doc.css('dt').each do |dt|
  ct = dt.xpath('count(following-sibling::dt)')
  dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]")
  p [dt.text,dds.map(&:text)]
end
#=> ["Label1", ["Value1"]]
#=> ["Label2", ["Value2"]]
#=> ["Label3", ["Value3a", "Value3b"]]
#=> ["Label4", ["Value4"]]



回答3:


After looking at the other answer here is an inefficient way of doing the same thing.

require 'nokogiri'
a = Nokogiri::HTML('<div id="first"><dt>Label1</dt><dd>Value1</dd><dt>Label2</dt><dd>Value2</dd></div>')

dt = []
dd = []

a.css("#first").each do |item|
  item.css("dt").each {|t| dt << t.text}
  item.css("dd").each {|t| dd << t.text}
end

dt.each_index do |i|
  puts dt[i] + ': ' + dd[i]
end

In css to reference the ID you need to put the # symbol before. For a class it's the . symbol.



来源:https://stackoverflow.com/questions/8482739/how-to-parse-consecutive-tags-with-nokogiri

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!