问题
I have the following HTML:
<div id="test_id">
<p>Some words.</p>
<p>Some more words.</p>
<p>Even more words.</p>
</div>
If I parse the HTML using:
doc = Nokogiri::HTML(open("http://my_url"))
and run
doc.css('#test_id').text
in the console I get:
=> "Some words.\nSome more words.\nEven more words"
How do I get the first <p>
element only?
I think I figured it out with .children
doc.css('#test_id').children[0].text
Is this the correct way to do this?
回答1:
The problem is that you're not using text
on the right type of object.
If you're looking at a NodeSet the text
documentation says:
Get the inner text of all contained Node objects
If you're looking at a Node AKA Element, it says:
Returns the content for this Node
Here's the difference:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="test_id">
<p>Some words.</p>
<p>Some more words.</p>
<p>Even more words.</p>
</div>
EOT
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "Some words.Some more words.Even more words."
doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "Some words."
at
is like search(...).first
.
Typically, if we want the text of a NodeSet we'd use:
doc.search('p').map(&:text) # => ["Some words.", "Some more words.", "Even more words."]
which makes it easy to pick the text of a specific node.
See "How to avoid joining all text from Nodes when scraping" also.
doc.css('#test_id').children[0].text
Well, yeah, you can do that, but children
isn't going to do the same thing:
doc.search('#test_id').children
# => [#<Nokogiri::XML::Text:0x3fc31580ca24 "\n ">, #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>, #<Nokogiri::XML::Text:0x3fc315107f44 "\n ">, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>, #<Nokogiri::XML::Text:0x3fc315107b20 "\n ">, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "Even more words.">]>, #<Nokogiri::XML::Text:0x3fc3151076fc "\n">]
doc.search('#test_id').children[0] # => #<Nokogiri::XML::Text:0x3fc31580ca24 "\n ">
doc.search('#test_id').children[1] # => #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>
versus:
doc.search('#test_id p')
# => [#<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "Even more words.">]>]
doc.search('#test_id p')[0] # => #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>
doc.search('#test_id p')[1] # => #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>
Notice how children
is returning the text nodes between the tags used to format the HTML. You have to be aware that children
returns everything in the HTML below the selected tag. This is useful sometimes but for general text retrieval it's probably not what you want.
Instead, use the more selective '#test_id p'
selector and iterate over the returned NodeSet and you'll avoid the formatting text nodes and won't have to account for them when using a slice or index into the NodeSet.
回答2:
You can also try this.
$("p:first-child").text();
This will give you all the first children of ANY parent element. So for your example it should work
来源:https://stackoverflow.com/questions/40853554/how-to-access-multiple-p-tags-one-at-a-time