using nokogiri to parse google picasa api xml - namespacing issue?

问题

I am trying to get some data from some google picasa xml, and am having a bit of trouble..

Here is the actual xml (containing just one entry): http://pastie.org/1736008

Basically, I would like to collect a few of the gphoto attributes, so ideally what I would like to do is:

doc.xpath('//entry').map do |entry|
  {:id => entry.children['gphoto:id'],
   :thumb => entry.children['gphoto:thumbnail'],
   :name => entry.children['gphoto:name'],
   :count => entry.children['gphoto:numphotos']}
end

However, this does not work... In fact, when I examine the children of entry, I do not even see any 'gphoto:xxx' atttributes at all... So I am quite confused as to how to find them.

Thanks!

回答1:

Here's some working code which uses nokogiri to extract the gphoto elements from your example xml.

#!/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
content = File.read('input.xml')
doc = Nokogiri::XML(content) {|config| 
          config.options = Nokogiri::XML::ParseOptions::STRICT
      }

hashes = doc.xpath('//xmlns:entry').map do |entry|
  {
    :id => entry.xpath('gphoto:id').inner_text,
    :thumb => entry.parent.xpath('gphoto:thumbnail').inner_text,
    :name => entry.xpath('gphoto:name').inner_text,
    :count => entry.xpath('gphoto:numphotos').inner_text
  }
end

puts hashes.inspect

# yields: 
#
# [{:count=>"37", :name=>"Melody19Months", :thumb=>"http://lh3.ggpht.com/_Viv8WkAChHU/AAAAAAAAAAA/AAAAAAAAAAA/pNuu5PgnP1Y/s64-c/soopingsaw.jpg", :id=>"5582695833628950881"}]

Notes:

The sample xml in your gist needed a closing "feed" tag. Fixed here.
In the xpath expression to find the entry elements we must use a namespace prefix, so "xmlns:entry", not just "entry". The latter (used in your original code) will find no elements. It is looking for elements in the null namespace, but in your example, they all inherit the default namespace specified on the feed element. Aaron Patterson wrote a (Nokogiri-centric) introduction to the problem, here, and there's another here.
The element gphoto:thumbnail is a child of the feed element, and not of each entry. I have made a small (hacky) adjustment for that, keeping in the design of your original example, but it would be far better to seek out the value of this element only once per feed (perhaps later populating the entry hashes if they really need to each keep a copy).
Configuring Nokogiri to be strict is not actually necessary, but it's nice to get a little help in spotting problems early.

回答2:

You can search for the entry nodes, then look inside each one to extract the gphoto namespaced nodes:

require 'nokogiri'

doc = Nokogiri::XML(open('./test.xml'))
hashes = doc.search('//xmlns:entry').map do |entry|
  h = {}
  entry.search("*[namespace-uri()='http://schemas.google.com/photos/2007']").each do |gphoto|
    h[gphoto.name] = gphoto.text
  end
  h
end

require 'ap'
ap hashes
# >> [
# >>     [0] {
# >>                        "id" => "5582695833628950881",
# >>                      "name" => "Melody19Months",
# >>                  "location" => "",
# >>                    "access" => "public",
# >>                 "timestamp" => "1299649559000",
# >>                 "numphotos" => "37",
# >>                      "user" => "soopingsaw",
# >>                  "nickname" => "sooping",
# >>         "commentingEnabled" => "true",
# >>              "commentCount" => "0"
# >>     }
# >> ]

That returns all the //entry/gphoto:* notes. If you want only certain ones you can filter for what you want:

require 'nokogiri'

doc = Nokogiri::XML(open('./test.xml'))
hashes = doc.search('//xmlns:entry').map do |entry|
  h = {}
  entry.search("*[namespace-uri()='http://schemas.google.com/photos/2007']").each do |gphoto|
    h[gphoto.name] = gphoto.text if (%w[id thumbnail name numphotos].include?(gphoto.name))
  end
  h
end

require 'ap'
ap hashes

# >> [
# >>     [0] {
# >>                "id" => "5582695833628950881",
# >>              "name" => "Melody19Months",
# >>         "numphotos" => "37"
# >>     }
# >> ]

Notice that in the original question an attempt to access gphoto:thumbnail occurs, however there is no matching node for //element/gphoto:thumbnails, so it can't be found.

Another way to write the search using the namespace is:

require 'nokogiri'

doc = Nokogiri::XML(open('./test.xml'))
hashes = doc.search('//xmlns:entry').map do |entry|
  h = {}
  entry.search("*").each do |gphoto|
    h[gphoto.name] = gphoto.text if (
      (gphoto.namespace.prefix=='gphoto') && 
      (%w[id thumbnail name numphotos].include?(gphoto.name))
    )
  end
  h
end

require 'ap'
ap hashes

# >> [
# >>     [0] {
# >>                "id" => "5582695833628950881",
# >>              "name" => "Melody19Months",
# >>         "numphotos" => "37"
# >>     }
# >> ]

Rather than using XPath, it's asking Nokogiri to look at each node's namespace attributes.

来源：https://stackoverflow.com/questions/5490640/using-nokogiri-to-parse-google-picasa-api-xml-namespacing-issue

标签

xml

google-api

nokogiri