Consider I have the following example XML file:
You could do this pretty easily with XSLT. Looking at your examples, it seems like you only want the XPath of elements that contain text. If that's not the case, let me know and I can update the XSLT.
I created a new input example to show how it handles siblings with the same name. In this case, <article>
.
XML Input
<ns1:create xmlns:ns1='http://predic8.com/wsdl/material/ArticleService/1/'>
<article xmlns:ns1='http://predic8.com/material/1/'>
<name xmlns:ns1='http://predic8.com/material/1/'>foo</name>
<description xmlns:ns1='http://predic8.com/material/1/'>bar</description>
<price xmlns:ns1='http://predic8.com/common/1/'>
<amount xmlns:ns1='http://predic8.com/common/1/'>00.00</amount>
<currency xmlns:ns1='http://predic8.com/common/1/'>USD</currency>
</price>
<id xmlns:ns1='http://predic8.com/material/1/'>1</id>
</article>
<article xmlns:ns1='http://predic8.com/material/2/'>
<name xmlns:ns1='http://predic8.com/material/2/'>some name</name>
<description xmlns:ns1='http://predic8.com/material/2/'>some description</description>
<price xmlns:ns1='http://predic8.com/common/2/'>
<amount xmlns:ns1='http://predic8.com/common/2/'>00.01</amount>
<currency xmlns:ns1='http://predic8.com/common/2/'>USD</currency>
</price>
<id xmlns:ns1='http://predic8.com/material/2/'>2</id>
</article>
</ns1:create>
XSLT 1.0
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="text()"/>
<xsl:template match="*[text()]">
<xsl:call-template name="genPath"/>
<xsl:apply-templates select="node()|@*"/>
</xsl:template>
<xsl:template name="genPath">
<xsl:param name="prevPath"/>
<xsl:variable name="currPath" select="concat('/',local-name(),'[',
count(preceding-sibling::*[name() = name(current())])+1,']',$prevPath)"/>
<xsl:for-each select="parent::*">
<xsl:call-template name="genPath">
<xsl:with-param name="prevPath" select="$currPath"/>
</xsl:call-template>
</xsl:for-each>
<xsl:if test="not(parent::*)">
<xsl:value-of select="$currPath"/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Output
/create[1]/article[1]/name[1]
/create[1]/article[1]/description[1]
/create[1]/article[1]/price[1]/amount[1]
/create[1]/article[1]/price[1]/currency[1]
/create[1]/article[1]/id[1]
/create[1]/article[2]/name[1]
/create[1]/article[2]/description[1]
/create[1]/article[2]/price[1]/amount[1]
/create[1]/article[2]/price[1]/currency[1]
/create[1]/article[2]/id[1]
UPDATE
For the XSLT to work for all elements, simply remove the [text()]
predicate from match="*[text()]"
. This will output the path for every element. If you don't want the path output for elements that contain other elements (like create, article, and price) add the predicate [not(*)]
. Here's an updated example:
New XML Input
<ns1:create xmlns:ns1='http://predic8.com/wsdl/material/ArticleService/1/'>
<article xmlns:ns1='http://predic8.com/material/1/'>
<name />
<description />
<price xmlns:ns1='http://predic8.com/common/1/'>
<amount />
<currency xmlns:ns1='http://predic8.com/common/1/'></currency>
</price>
<id xmlns:ns1='http://predic8.com/material/1/'></id>
</article>
<article xmlns:ns1='http://predic8.com/material/2/'>
<name xmlns:ns1='http://predic8.com/material/2/'>some name</name>
<description xmlns:ns1='http://predic8.com/material/2/'>some description</description>
<price xmlns:ns1='http://predic8.com/common/2/'>
<amount xmlns:ns1='http://predic8.com/common/2/'>00.01</amount>
<currency xmlns:ns1='http://predic8.com/common/2/'>USD</currency>
</price>
<id xmlns:ns1='http://predic8.com/material/2/'>2</id>
</article>
</ns1:create>
XSLT 1.0
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="text()"/>
<xsl:template match="*[not(*)]">
<xsl:call-template name="genPath"/>
<xsl:apply-templates select="node()"/>
</xsl:template>
<xsl:template name="genPath">
<xsl:param name="prevPath"/>
<xsl:variable name="currPath" select="concat('/',local-name(),'[',
count(preceding-sibling::*[name() = name(current())])+1,']',$prevPath)"/>
<xsl:for-each select="parent::*">
<xsl:call-template name="genPath">
<xsl:with-param name="prevPath" select="$currPath"/>
</xsl:call-template>
</xsl:for-each>
<xsl:if test="not(parent::*)">
<xsl:value-of select="$currPath"/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Output
/create[1]/article[1]/name[1]
/create[1]/article[1]/description[1]
/create[1]/article[1]/price[1]/amount[1]
/create[1]/article[1]/price[1]/currency[1]
/create[1]/article[1]/id[1]
/create[1]/article[2]/name[1]
/create[1]/article[2]/description[1]
/create[1]/article[2]/price[1]/amount[1]
/create[1]/article[2]/price[1]/currency[1]
/create[1]/article[2]/id[1]
If you remove the [not(*)]
predicate, this is what the output looks like (a path is output for every element):
/create[1]
/create[1]/article[1]
/create[1]/article[1]/name[1]
/create[1]/article[1]/description[1]
/create[1]/article[1]/price[1]
/create[1]/article[1]/price[1]/amount[1]
/create[1]/article[1]/price[1]/currency[1]
/create[1]/article[1]/id[1]
/create[1]/article[2]
/create[1]/article[2]/name[1]
/create[1]/article[2]/description[1]
/create[1]/article[2]/price[1]
/create[1]/article[2]/price[1]/amount[1]
/create[1]/article[2]/price[1]/currency[1]
/create[1]/article[2]/id[1]
Here's another version of the XSLT that is about 65% faster:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="text()"/>
<xsl:template match="*[not(*)]">
<xsl:for-each select="ancestor-or-self::*">
<xsl:value-of select="concat('/',local-name(),'[',count(preceding-sibling::*[local-name()=local-name(current())])+1,']')"/>
</xsl:for-each>
<xsl:text>
</xsl:text>
<xsl:apply-templates select="node()"/>
</xsl:template>
</xsl:stylesheet>
My recommendation is to use a SAX parser. wiki entry for SAX , Xerces: a SAX parser for java by Apache
On each start element, add the name of the element onto the end of a list. On each end element, remove the last list entry. When you run into content, and you want to output your xpath, it can be retrieved by iterating the list.