I\'m interested in advice/pseudocode code/explanation rather than actual implementation.
Here is how this can be done with SAX:
import java.util.HashMap;
import java.util.Map;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
public class FragmentContentHandler extends DefaultHandler {
private String xPath = "/";
private XMLReader xmlReader;
private FragmentContentHandler parent;
private StringBuilder characters = new StringBuilder();
private Map<String, Integer> elementNameCount = new HashMap<String, Integer>();
public FragmentContentHandler(XMLReader xmlReader) {
this.xmlReader = xmlReader;
}
private FragmentContentHandler(String xPath, XMLReader xmlReader, FragmentContentHandler parent) {
this(xmlReader);
this.xPath = xPath;
this.parent = parent;
}
@Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
Integer count = elementNameCount.get(qName);
if(null == count) {
count = 1;
} else {
count++;
}
elementNameCount.put(qName, count);
String childXPath = xPath + "/" + qName + "[" + count + "]";
int attsLength = atts.getLength();
for(int x=0; x<attsLength; x++) {
System.out.println(childXPath + "[@" + atts.getQName(x) + "='" + atts.getValue(x) + ']');
}
FragmentContentHandler child = new FragmentContentHandler(childXPath, xmlReader, this);
xmlReader.setContentHandler(child);
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
String value = characters.toString().trim();
if(value.length() > 0) {
System.out.println(xPath + "='" + characters.toString() + "'");
}
xmlReader.setContentHandler(parent);
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
characters.append(ch, start, length);
}
}
It can be tested with:
import java.io.FileInputStream;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.InputSource;
import org.xml.sax.XMLReader;
public class Demo {
public static void main(String[] args) throws Exception {
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader xr = sp.getXMLReader();
xr.setContentHandler(new FragmentContentHandler(xr));
xr.parse(new InputSource(new FileInputStream("input.xml")));
}
}
This will produce the desired output:
//root[1]/elemA[1]='one'
//root[1]/elemA[2][@attribute1='first]
//root[1]/elemA[2][@attribute2='second]
//root[1]/elemA[2]='two'
//root[1]/elemB[1]='three'
//root[1]/elemA[3]='four'
//root[1]/elemC[1]/elemB[1]='five'
I've done a similar task once. The main idea used was that you can use indexes of the element in the xpath. For example in the following xml
<root>
<el />
<something />
<el />
</root>
xpath to the second <el/>
will be /root[1]/el[2]
(xpath indexes are 1-based). This reads as "take the first root, then take the second one from all elements with the name el". So element something
does not affect indexing of elements el
. So you can in theory create an xpath for each specific element in your xml. In practice I've accomplished this by walking the tree recursevely and remembering information about elements and their indexes along the way.
Creating xpath referencing specific attribute of the element then was just adding '/@attrName' to element's xpath.
Update:
@c0mrade has updated his question. Here is a solution to it:
This XSLT transformation:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="vApos">'</xsl:variable>
<xsl:template match="*[@* or not(*)] ">
<xsl:if test="not(*)">
<xsl:apply-templates select="ancestor-or-self::*" mode="path"/>
<xsl:value-of select="concat('=',$vApos,.,$vApos)"/>
<xsl:text>
</xsl:text>
</xsl:if>
<xsl:apply-templates select="@*|*"/>
</xsl:template>
<xsl:template match="*" mode="path">
<xsl:value-of select="concat('/',name())"/>
<xsl:variable name="vnumPrecSiblings" select=
"count(preceding-sibling::*[name()=name(current())])"/>
<xsl:if test="$vnumPrecSiblings">
<xsl:value-of select="concat('[', $vnumPrecSiblings +1, ']')"/>
</xsl:if>
</xsl:template>
<xsl:template match="@*">
<xsl:apply-templates select="../ancestor-or-self::*" mode="path"/>
<xsl:value-of select="concat('[@',name(), '=',$vApos,.,$vApos,']')"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
when applied on the provided XML document:
<root>
<elemA>one</elemA>
<elemA attribute1='first' attribute2='second'>two</elemA>
<elemB>three</elemB>
<elemA>four</elemA>
<elemC>
<elemB>five</elemB>
</elemC>
</root>
produces exactly the wanted, correct result:
/root/elemA='one'
/root/elemA[2]='two'
/root/elemA[2][@attribute1='first']
/root/elemA[2][@attribute2='second']
/root/elemB='three'
/root/elemA[3]='four'
/root/elemC/elemB='five'
When applied to the newly-provided document by @c0mrade:
<root>
<elemX serial="kefw90234kf2esda9231">
<id>89734</id>
</elemX>
</root>
again the correct result is produced:
/root/elemX='89734'
/root/elemX[@serial='kefw90234kf2esda9231']
Explanation:
Only elements that have no children elements, or have attributes are matched and processed.
For any such element, if it doesn't have children-elements all of its ancestor-or self elements are processed in a specific mode, named 'path'
. Then the "='theValue'"
part is output and then a NL character.
All attributes of the matched element are then processed.
Then finally, templates are applied to all children-elements.
Processing an element in the 'path'
mode is simple: A /
character and the name of the element are output. Then, if there are preceding siblings with the same name, a "[numPrecSiblings+1]` part is output.
Processing of attributes is simple: First all ancestor-or-self::
elements of its parent are processed in 'path'
mode, then the [attrName=attrValue] part is output, followed by a NL character.
Do note:
Names that are in a namespace are displayed without any problem and in their initial readable form.
To aid readability, an index of [1]
is never displayed.
Below is my initial answer (may be ignored)
Here is a pure XSLT 1.0 solution:
Below is a sample xml document and a stylesheet that takes a node-set parameter and produces one valid XPath expression for every member-node.
stylesheet (buildPath.xsl):
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
>
<xsl:output method="text"/>
<xsl:variable name="theParmNodes" select="//namespace::*[local-name() =
'myNamespace']"/>
<xsl:template match="/">
<xsl:variable name="theResult">
<xsl:for-each select="$theParmNodes">
<xsl:variable name="theNode" select="."/>
<xsl:for-each select="$theNode |
$theNode/ancestor-or-self::node()[..]">
<xsl:element name="slash">/</xsl:element>
<xsl:choose>
<xsl:when test="self::*">
<xsl:element name="nodeName">
<xsl:value-of select="name()"/>
<xsl:variable name="thisPosition"
select="count(preceding-sibling::*[name(current()) =
name()])"/>
<xsl:variable name="numFollowing"
select="count(following-sibling::*[name(current()) =
name()])"/>
<xsl:if test="$thisPosition + $numFollowing > 0">
<xsl:value-of select="concat('[', $thisPosition +
1, ']')"/>
</xsl:if>
</xsl:element>
</xsl:when>
<xsl:otherwise> <!-- This node is not an element -->
<xsl:choose>
<xsl:when test="count(. | ../@*) = count(../@*)">
<!-- Attribute -->
<xsl:element name="nodeName">
<xsl:value-of select="concat('@',name())"/>
</xsl:element>
</xsl:when>
<xsl:when test="self::text()"> <!-- Text -->
<xsl:element name="nodeName">
<xsl:value-of select="'text()'"/>
<xsl:variable name="thisPosition"
select="count(preceding-sibling::text())"/>
<xsl:variable name="numFollowing"
select="count(following-sibling::text())"/>
<xsl:if test="$thisPosition + $numFollowing > 0">
<xsl:value-of select="concat('[', $thisPosition +
1, ']')"/>
</xsl:if>
</xsl:element>
</xsl:when>
<xsl:when test="self::processing-instruction()">
<!-- Processing Instruction -->
<xsl:element name="nodeName">
<xsl:value-of select="'processing-instruction()'"/>
<xsl:variable name="thisPosition"
select="count(preceding-sibling::processing-instruction())"/>
<xsl:variable name="numFollowing"
select="count(following-sibling::processing-instruction())"/>
<xsl:if test="$thisPosition + $numFollowing > 0">
<xsl:value-of select="concat('[', $thisPosition +
1, ']')"/>
</xsl:if>
</xsl:element>
</xsl:when>
<xsl:when test="self::comment()"> <!-- Comment -->
<xsl:element name="nodeName">
<xsl:value-of select="'comment()'"/>
<xsl:variable name="thisPosition"
select="count(preceding-sibling::comment())"/>
<xsl:variable name="numFollowing"
select="count(following-sibling::comment())"/>
<xsl:if test="$thisPosition + $numFollowing > 0">
<xsl:value-of select="concat('[', $thisPosition +
1, ']')"/>
</xsl:if>
</xsl:element>
</xsl:when>
<!-- Namespace: -->
<xsl:when test="count(. | ../namespace::*) =
count(../namespace::*)">
<xsl:variable name="apos">'</xsl:variable>
<xsl:element name="nodeName">
<xsl:value-of select="concat('namespace::*',
'[local-name() = ', $apos, local-name(), $apos, ']')"/>
</xsl:element>
</xsl:when>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="msxsl:node-set($theResult)"/>
</xsl:template>
</xsl:stylesheet>
xml source (buildPath.xml):
<!-- top level Comment -->
<root>
<nodeA>textA</nodeA>
<nodeA id="nodeA-2">
<?myProc ?>
xxxxxxxx
<nodeB/>
<nodeB xmlns:myNamespace="myTestNamespace">
<!-- Comment within /root/nodeA[2]/nodeB[2] -->
<nodeC/>
<!-- 2nd Comment within /root/nodeA[2]/nodeB[2] -->
</nodeB>
yyyyyyy
<nodeB/>
<?myProc2 ?>
</nodeA>
</root>
<!-- top level Comment -->
Result:
/root/nodeA[2]/nodeB[2]/namespace::*[local-name() = 'myNamespace']
/root/nodeA[2]/nodeB[2]/nodeC/namespace::*[local-name() =
'myNamespace']
With jOOX (a jquery API port to Java, disclaimer - I work for the company behind the library), you can almost achieve what you want in a single statement:
// I'm assuming this:
import static org.joox.JOOX.$;
// And then...
List<String> coolList = $(document).xpath("//*[not(*)]").map(
context -> $(context).xpath() + "='" + $(context).text() + "'"
);
If document is your sample document:
<root>
<elemA>one</elemA>
<elemA attribute1='first' attribute2='second'>two</elemA>
<elemB>three</elemB>
<elemA>four</elemA>
<elemC>
<elemB>five</elemB>
</elemC>
</root>
This will produce
/root[1]/elemA[1]='one'
/root[1]/elemA[2]='two'
/root[1]/elemB[1]='three'
/root[1]/elemA[3]='four'
/root[1]/elemC[1]/elemB[1]='five'
By "almost", I mean that jOOX does not (yet) support matching/mapping attributes. Hence, your attributes will not produce any output. This will be implemented in the near future, though.
private static void buildEntryList( List<String> entries, String parentXPath, Element parent ) {
NamedNodeMap attrs = parent.getAttributes();
for( int i = 0; i < attrs.getLength(); i++ ) {
Attr attr = (Attr)attrs.item( i );
//TODO: escape attr value
entries.add( parentXPath+"[@"+attr.getName()+"='"+attr.getValue()+"']");
}
HashMap<String, Integer> nameMap = new HashMap<String, Integer>();
NodeList children = parent.getChildNodes();
for( int i = 0; i < children.getLength(); i++ ) {
Node child = children.item( i );
if( child instanceof Text ) {
//TODO: escape child value
entries.add( parentXPath+"='"+((Text)child).getData()+"'" );
} else if( child instanceof Element ) {
String childName = child.getNodeName();
Integer nameCount = nameMap.get( childName );
nameCount = nameCount == null ? 1 : nameCount + 1;
nameMap.put( child.getNodeName(), nameCount );
buildEntryList( entries, parentXPath+"/"+childName+"["+nameCount+"]", (Element)child);
}
}
}
public static List<String> getEntryList( Document doc ) {
ArrayList<String> entries = new ArrayList<String>();
Element root = doc.getDocumentElement();
buildEntryList(entries, "/"+root.getNodeName()+"[1]", root );
return entries;
}
This code works with two assumptions: you aren't using namespaces and there are no mixed content elements. The namespace limitation isn't a serious one, but it'd make your XPath expression much harder to read, as every element would be something like *:<name>[namespace-uri()='<nsuri>'][<index>]
, but otherwise it's easy to implement. Mixed content on the other hand would make the use of xpath very tedious, as you'd have to be able to individually address the second, third and so on text node within an element.
I have written a method to return the absolute path of an element in the Practical XML library. To give you an idea of how it works, here's an extract form one of the unit tests:
assertEquals("/root/wargle[2]/zargle",
DomUtil.getAbsolutePath(child3a));
So, you could recurse through the document, apply your tests, and use this to return the XPath. Or, what is probably better, is that you could use the XPath-based assertions from that same library.