问题
I have data that looks like:
<release>
<artists>
<artist>
<name>Johnny Mnemonic</name>
</artist>
<artist>
<name>Constantine</name>
</artist>
<artists>
</release>
<release>
<artists>
<artist>
<name>Speed</name>
</artist>
<artist>
<name>The Matrix</name>
</artist>
<artists>
</release>
...and so on.
For each release I want only the data from the first <artist>
tag. I tried the following code but it pulls all text from the artists:
page = Nokogiri::XML(open("37.xml"))
page.xpath("//artists[1]").each do |el|
File.open("#{LOCAL_DIR}/37.txt", 'a'){|f| f.write(el)}
回答1:
Nokogiri supports two main types of searches, search
and at
. search
returns a NodeSet, which you should think of like an array. at
returns a Node. Either can take a CSS or XPath expression. I prefer CSS since they're more readable, but sometimes you can't easily get where you want to be with one, so try the other.
For your question, it's important to specify the node you want to extract the text from, using text
. If your result is too broad you'll get text from between tags in addition to the text inside the tag you want. To avoid that drill down to the most-immediate node to what you're trying to read:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<release>
<artists>
<artist>
<name>Johnny Mnemonic</name>
</artist>
<artist>
<name>Constantine</name>
</artist>
<artists>
<release>
EOT
Because these look for the name
node specifically, the text desired is easy to get without garbage:
doc.at('name').text # => "Johnny Mnemonic"
doc.at('artist name').text # => "Johnny Mnemonic"
doc.at('artists artist name').text # => "Johnny Mnemonic"
These are looser searches so more junk is returned:
doc.at('artist').text # => "\n Johnny Mnemonic\n "
doc.at('artists').text # => "\n \n Johnny Mnemonic\n \n \n Constantine\n \n \n\n"
Using search
returns multiple nodes:
doc.search('name').map(&:text)
[
[0] "Johnny Mnemonic",
[1] "Constantine"
]
doc.search('artist').map(&:text)
[
[0] "\n Johnny Mnemonic\n ",
[1] "\n Constantine\n "
]
The only real difference between search
and at
is that at
is like search(...).first
.
See "How to avoid joining all text from Nodes when scraping" also.
Nokogiri has some additional aliases for convenience: at_css
and css
, and at_xpath
and xpath
.
Here are alternate ways, using CSS and XPath accessors to get at the names, clipped from Pry:
[5] (pry) main: 0> # using CSS with Ruby
[6] (pry) main: 0> artists = doc.search('release').map{ |release| release.at('artist').text.strip }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[7] (pry) main: 0> # using CSS with less Ruby
[8] (pry) main: 0> artists = doc.search('release artists artist:nth-child(1) name').map{ |n| n.text }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[9] (pry) main: 0>
[10] (pry) main: 0> # using XPath
[11] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name').map{ |t| t.content }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[12] (pry) main: 0> # using more XPath
[13] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name/text()').map{ |t| t.content }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
回答2:
Your xpath expression selects the <artists>
, not each <artist>
tag as you seem to expect.Try this:
doc.search('artists artist').map(&:text)
Your expression "//artists"
will retrieve all 'artists' tags, the [1]
selects the first of these tags, not the first element inside the tag itself.
来源:https://stackoverflow.com/questions/15485940/how-to-collect-the-first-of-several-elements-of-a-node-in-nokogiri