xml_find_all function from xml2 package (R) does not find relevant nodes

问题

I am using the xml2 package in R to access xml data, and found that it behaves different on different xml_documents.

On this pet example

library(xml2)
doc <- read_xml( "<MEMBERS>
                      <CUSTOMER>
                         <ID>178</ID>
                         <FIRST.NAME>Alvaro</FIRST.NAME>
                         <LAST.NAME>Juarez</LAST.NAME>
                         <ADDRESS>123 Park Ave</ADDRESS>
                         <ZIP>57701</ZIP>
                      </CUSTOMER>
                      <CUSTOMER>
                         <ID>934</ID>
                         <FIRST.NAME>Janette</FIRST.NAME>
                         <LAST.NAME>Johnson</LAST.NAME>
                         <ADDRESS>456 Candy Ln</ADDRESS>
                         <ZIP>57701</ZIP>
                      </CUSTOMER>  
                   </MEMBERS>")
doc
{xml_document}
<MEMBERS>
[1] <CUSTOMER>\n  <ID>178</ID>\n  <FIRST.NAME>Alvaro</FIRST.NAME>\n  <LAST.NAME>Juarez</LAST.NAME>\n  <ADDRESS>12 ...
[2] <CUSTOMER>\n  <ID>934</ID>\n  <FIRST.NAME>Janette</FIRST.NAME>\n  <LAST.NAME>Johnson</LAST.NAME>\n  <ADDRESS> ...

I can run the following code

xml_find_all(doc, "//FIRST.NAME")
{xml_nodeset (2)}
[1] <FIRST.NAME>Alvaro</FIRST.NAME>
[2] <FIRST.NAME>Janette</FIRST.NAME>

giving me the expected output (finding all nodes with 'FIRST.NAME' tags).

However, if I perform the same action on this xml file:

example <- read_xml(file.path("~/Downloads", "uniprot_subset.xml"))
> example
{xml_document}
<uniprot>
 [1] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="35">\n  <accession>Q6GZX4</accession>\n  <name>001R_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Putative tr ...
 [2] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="36">\n  <accession>Q6GZX3</accession>\n  <name>002L_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacter ...
 [3] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2018-06-20" version="22">\n  <accession>Q197F8</accession>\n  <name>002R_IIV3</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacteri ...
 [4] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2017-09-27" version="18">\n  <accession>Q197F7</accession>\n  <name>003L_IIV3</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacteri ...
 [5] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="31">\n  <accession>Q6GZX2</accession>\n  <name>003R_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacter ...
 [6] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2017-09-27" version="29">\n  <accession>Q6GZX1</accession>\n  <name>004R_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacter ...
 [7] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2017-09-27" version="24">\n  <accession>Q197F5</accession>\n  <name>005L_IIV3</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacteri ...
 [8] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="38">\n  <accession>Q6GZX0</accession>\n  <name>005R_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacter ...
 [9] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2019-01-16" version="44">\n  <accession>Q91G88</accession>\n  <name>006L_IIV6</name>\n  <protein>\n    <recommendedName>\n      <fullName>Putative Kil ...
[10] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2017-09-27" version="27">\n  <accession>Q6GZW9</accession>\n  <name>006R_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacter ...

it behaves differently

xml_find_all(example, "//accession")
{xml_nodeset (0)}

Basically, it will not find any nodes with the 'accession' tag, even though they exist and can be accessed by different functions, for instance using

xml_children(xml_children(example)[1])[1]
{xml_nodeset (1)}
[1] <accession>Q6GZX4</accession>

Can anyone tell me why the xml_find_all function does not find any nodes in the latter example?

回答1:

This happens because your pet example does not contain namespaces, but the second XML file does.

example %>% xml_ns()

d1  <-> http://uniprot.org/uniprot
d2  <-> http://uniprot.org/uniprot
d3  <-> http://uniprot.org/uniprot
d4  <-> http://uniprot.org/uniprot
d5  <-> http://uniprot.org/uniprot
d6  <-> http://uniprot.org/uniprot
d7  <-> http://uniprot.org/uniprot
d8  <-> http://uniprot.org/uniprot
d9  <-> http://uniprot.org/uniprot
d10 <-> http://uniprot.org/uniprot

Since each entry has the same namespace, in this case the simplest approach is probably to strip (remove) the namespaces:

example %>% xml_ns_strip()

And xml_find_all should now work as expected:

example %>% xml_find_all("//accession")

{xml_nodeset (10)}
 [1] <accession>Q6GZX4</accession>
 [2] <accession>Q6GZX3</accession>
 [3] <accession>Q197F8</accession>
 [4] <accession>Q197F7</accession>
 [5] <accession>Q6GZX2</accession>
 [6] <accession>Q6GZX1</accession>
 [7] <accession>Q197F5</accession>
 [8] <accession>Q6GZX0</accession>
 [9] <accession>Q91G88</accession>
[10] <accession>Q6GZW9</accession>

If you wanted to retain the namespaces, you could access accessions like so:

example %>% xml_find_all("//d1:accession")

which works in this case because the default name d1 given to the namespace for the first entry maps to the same namespace for all entries.

来源：https://stackoverflow.com/questions/55727236/xml-find-all-function-from-xml2-package-r-does-not-find-relevant-nodes

标签

xml

xml2