How to access multiple <p> tags one at a time

这一生的挚爱 提交于 2019-12-25 08:23:54

问题


I have the following HTML:

<div id="test_id">
    <p>Some words.</p>
    <p>Some more words.</p>
    <p>Even more words.</p>
</div>

If I parse the HTML using:

doc = Nokogiri::HTML(open("http://my_url"))

and run

doc.css('#test_id').text

in the console I get:

=> "Some words.\nSome more words.\nEven more words"

How do I get the first <p> element only?


I think I figured it out with .children

doc.css('#test_id').children[0].text

Is this the correct way to do this?


回答1:


The problem is that you're not using text on the right type of object.

If you're looking at a NodeSet the text documentation says:

Get the inner text of all contained Node objects

If you're looking at a Node AKA Element, it says:

Returns the content for this Node

Here's the difference:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div id="test_id">
    <p>Some words.</p>
    <p>Some more words.</p>
    <p>Even more words.</p>
</div>
EOT

doc.search('p').class  # => Nokogiri::XML::NodeSet
doc.search('p').text  # => "Some words.Some more words.Even more words."

doc.at('p').class  # => Nokogiri::XML::Element
doc.at('p').text  # => "Some words."

at is like search(...).first.

Typically, if we want the text of a NodeSet we'd use:

doc.search('p').map(&:text)  # => ["Some words.", "Some more words.", "Even more words."]

which makes it easy to pick the text of a specific node.

See "How to avoid joining all text from Nodes when scraping" also.

doc.css('#test_id').children[0].text

Well, yeah, you can do that, but children isn't going to do the same thing:

doc.search('#test_id').children
# => [#<Nokogiri::XML::Text:0x3fc31580ca24 "\n    ">, #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>, #<Nokogiri::XML::Text:0x3fc315107f44 "\n    ">, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>, #<Nokogiri::XML::Text:0x3fc315107b20 "\n    ">, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "Even more words.">]>, #<Nokogiri::XML::Text:0x3fc3151076fc "\n">]
doc.search('#test_id').children[0] # => #<Nokogiri::XML::Text:0x3fc31580ca24 "\n    ">
doc.search('#test_id').children[1] # => #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>

versus:

doc.search('#test_id p')
# => [#<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "Even more words.">]>]
doc.search('#test_id p')[0] # => #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>
doc.search('#test_id p')[1] # => #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>

Notice how children is returning the text nodes between the tags used to format the HTML. You have to be aware that children returns everything in the HTML below the selected tag. This is useful sometimes but for general text retrieval it's probably not what you want.

Instead, use the more selective '#test_id p' selector and iterate over the returned NodeSet and you'll avoid the formatting text nodes and won't have to account for them when using a slice or index into the NodeSet.




回答2:


You can also try this.

$("p:first-child").text();

This will give you all the first children of ANY parent element. So for your example it should work



来源:https://stackoverflow.com/questions/40853554/how-to-access-multiple-p-tags-one-at-a-time

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!