How to access multiple tags one at a time

问题

I have the following HTML:

<div id="test_id">
    <p>Some words.</p>
    <p>Some more words.</p>
    <p>Even more words.</p>
</div>

If I parse the HTML using:

doc = Nokogiri::HTML(open("http://my_url"))

and run

doc.css('#test_id').text

in the console I get:

=> "Some words.\nSome more words.\nEven more words"

How do I get the first <p> element only?

I think I figured it out with .children

doc.css('#test_id').children[0].text

Is this the correct way to do this?

回答1:

The problem is that you're not using text on the right type of object.

If you're looking at a NodeSet the text documentation says:

Get the inner text of all contained Node objects

If you're looking at a Node AKA Element, it says:

Returns the content for this Node

Here's the difference:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div id="test_id">
    <p>Some words.</p>
    <p>Some more words.</p>
    <p>Even more words.</p>
</div>
EOT

doc.search('p').class  # => Nokogiri::XML::NodeSet
doc.search('p').text  # => "Some words.Some more words.Even more words."

doc.at('p').class  # => Nokogiri::XML::Element
doc.at('p').text  # => "Some words."

at is like search(...).first.

Typically, if we want the text of a NodeSet we'd use:

doc.search('p').map(&:text)  # => ["Some words.", "Some more words.", "Even more words."]

which makes it easy to pick the text of a specific node.

See "How to avoid joining all text from Nodes when scraping" also.

doc.css('#test_id').children[0].text

Well, yeah, you can do that, but children isn't going to do the same thing:

doc.search('#test_id').children
# => [#<Nokogiri::XML::Text:0x3fc31580ca24 "\n    ">, #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>, #<Nokogiri::XML::Text:0x3fc315107f44 "\n    ">, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>, #<Nokogiri::XML::Text:0x3fc315107b20 "\n    ">, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "Even more words.">]>, #<Nokogiri::XML::Text:0x3fc3151076fc "\n">]
doc.search('#test_id').children[0] # => #<Nokogiri::XML::Text:0x3fc31580ca24 "\n    ">
doc.search('#test_id').children[1] # => #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>

versus:

doc.search('#test_id p')
# => [#<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "Even more words.">]>]
doc.search('#test_id p')[0] # => #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>
doc.search('#test_id p')[1] # => #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>

Notice how children is returning the text nodes between the tags used to format the HTML. You have to be aware that children returns everything in the HTML below the selected tag. This is useful sometimes but for general text retrieval it's probably not what you want.

Instead, use the more selective '#test_id p' selector and iterate over the returned NodeSet and you'll avoid the formatting text nodes and won't have to account for them when using a slice or index into the NodeSet.

回答2:

You can also try this.

$("p:first-child").text();

This will give you all the first children of ANY parent element. So for your example it should work

来源：https://stackoverflow.com/questions/40853554/how-to-access-multiple-p-tags-one-at-a-time

标签

ruby-on-rails

ruby

nokogiri

How to access multiple <p> tags one at a time

问题

回答1:

回答2: