Open source command line tool for Linux to diff XML files ignoring element order

后端 未结 6 1548
走了就别回头了
走了就别回头了 2021-02-04 09:57

Is there an open source command-line tool (for Linux) to diff XML files which ignores the element order?

Example input file a.xml:



        
相关标签:
6条回答
  • 2021-02-04 10:31

    I had a similar problem and I eventually found: https://superuser.com/questions/79920/how-can-i-diff-two-xml-files

    That post suggests doing a canonical xml sort then doing a diff. Being that you are on linux, this should work for you cleanly. It worked for me on my mac, and should work for people on windows if they have something like cygwin installed:

    $ xmllint --c14n a.xml > sortedA.xml
    $ xmllint --c14n b.xml > sortedB.xml
    $ diff sortedA.xml sortedB.xml
    
    0 讨论(0)
  • 2021-02-04 10:41

    First your XML examples are not valid, because they lack a root element. I added a root element. This is a.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <tag name="AAA">
            <attr name="b" value="1"/>
            <attr name="c" value="2"/>
            <attr name="a" value="3"/>
        </tag>
        <tag name="BBB">
            <attr name="x" value="111"/>
            <attr name="z" value="222"/>
        </tag>
        <tag name="BBB">
            <attr name="x" value="333"/>
            <attr name="z" value="444"/>
        </tag>
    </root>
    

    And this is b.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <tag name="AAA">
            <attr name="a" value="3"/>
            <attr name="b" value="1"/>
            <attr name="c" value="2"/>
        </tag>
        <tag name="BBB">
            <attr name="z" value="444"/>
            <attr name="x" value="333"/>
        </tag>
        <tag name="BBB">
            <attr name="x" value="111"/>
            <attr name="z" value="222"/>
        </tag>
    </root>
    

    You can create a canonical form for the comparison by merging the siblings with the same name attribute and sorting by the tag name and the value.

    In order to merge the sibling elements with the same name you have to ignore the elements which name is the same like a preceding sibling and take the remaining. This can be done on the second element level by the following Xpath:

    *[not(@name = preceding-sibling::*/@name)]
    

    You have to take the name of those elements in order to select all the child elements which have a parent with this name. After that you have to sort by name and value. This makes it possible to transform both files into this canonical form:

    <?xml version="1.0" encoding="WINDOWS-1252"?>
    <root>
        <tag name="AAA">
            <attr name="a" value="3"/>
            <attr name="b" value="1"/>
            <attr name="c" value="2"/>
        </tag>
        <tag name="BBB">
            <attr name="x" value="111"/>
            <attr name="x" value="333"/>
            <attr name="z" value="222"/>
            <attr name="z" value="444"/>
        </tag>
    </root>
    

    This will do the transformation:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" encoding="WINDOWS-1252" omit-xml-declaration="no" indent="yes"/>
        <xsl:strip-space elements="*"/>
        <xsl:template match="/root">
            <xsl:copy>
                    <xsl:copy-of select="@*"/>
                    <xsl:for-each select="*[not(@name = preceding-sibling::*/@name)]">
                        <xsl:variable name="name" select="@name"/>
                        <xsl:copy>
                            <xsl:copy-of select="@*"/>
                            <xsl:for-each select="../*[@name = $name]/*">
                                <xsl:sort select="@name"/>
                                <xsl:sort select="@value"/>
                                <xsl:copy>
                                    <xsl:copy-of select="@*"/>
                                </xsl:copy>
                            </xsl:for-each>
                        </xsl:copy>
                    </xsl:for-each>
            </xsl:copy>
        </xsl:template>
    </xsl:stylesheet>
    
    0 讨论(0)
  • 2021-02-04 10:48

    From your example, it looks like you only care about re-ordering elements within elements, but not reordering elements themselves. If so, then (as a previous respondent said) you need to use sort, but on the elements, not the elements nor the attributes.

    Many would find it confusing to have XML elements named "tag" and/or "attr", since those are terms with specific meanings already in XML -- possibly that contributed to trying to sort by "@*" instead of sorting elements?

    If your structure is really just like your example, a much more "XML-ish" representation would be:

    <AAA b="1" c="2" a="3" />
    <BBB x="111" z="222" />
    <BBB x="333" z="444" />
    

    Much more compact, avoids the terminology conflict, and makes the attributes be order-independent by definition -- which means any off-the-shelf XML diff utility will get the effect it seems you want, or you could just convert to canonical XML and use regular diff.

    0 讨论(0)
  • 2021-02-04 10:49

    if you wanted to take this to an arbitrary degree, you could implement something that walks the two trees together and decides along the way which elements "match" between the two documents. that'd let you implement the matching logic any way you want. here is an example in xslt 2.0:

    <xsl:stylesheet version="2.0"
                    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    
                    xmlns:set="http://exslt.org/sets"
    
                    xmlns:primary="primary"
                    xmlns:control="control"
    
                    xmlns:util="util"
    
                    exclude-result-prefixes="xsl xs set primary control">
    
        <xsl:output method="text"/>
    
        <xsl:strip-space elements="*"/>
    
        <xsl:template match="/">
            <xsl:call-template name="compare">
                <xsl:with-param name="primary" select="*/*[1]"/><!-- first child of root element, for example -->
                <xsl:with-param name="control" select="*/*[2]"/><!-- second child of root element, for example --> 
            </xsl:call-template>
        </xsl:template>
    
        <!-- YOUR SPECIFIC OVERRIDES -->
    
        <xsl:template match="attr" mode="find-match" as="element()?">
            <xsl:param name="candidates" as="element()*"/>
            <!-- attr matches by @name and @value -->
            <xsl:sequence select="$candidates[@name = current()/@name][@value = current()/@value][1]"/>
        </xsl:template>
    
        <xsl:template match="tag" mode="find-match" as="element()?">
            <xsl:param name="candidates" as="element()*"/>
            <xsl:variable name="attrs" select="attr"/>
            <!-- tag matches if @name matches and attr counts (matched and unmatched) match -->
            <xsl:sequence select="$candidates[@name = current()/@name]
                                             [count($attrs) = count(util:find-match($attrs, attr))]
                                             [count($attrs) = count(attr)][1]"/>
        </xsl:template>
    
        <xsl:function name="util:find-match">
            <xsl:param name="this"/>
            <xsl:param name="candidates"/>
            <xsl:apply-templates select="$this" mode="find-match">
                <xsl:with-param name="candidates" select="$candidates"/>
            </xsl:apply-templates>
        </xsl:function>
    
        <!-- END SPECIFIC OVERRIDES -->
    
        <!-- compare "primary" and "control" elements -->
        <xsl:template name="compare">
            <xsl:param name="primary"/>
            <xsl:param name="control"/>
    
            <xsl:variable name="diff">
                <xsl:call-template name="match-children">
                    <xsl:with-param name="primary" select="$primary"/>
                    <xsl:with-param name="control" select="$control"/>
                </xsl:call-template>
            </xsl:variable>
    
            <xsl:choose>
                <xsl:when test="$diff//*[self::primary:* | self::control:*]">
                    <xsl:text>FAIL</xsl:text><!-- or do something more sophisticated with $diff... -->
                </xsl:when>
                <xsl:otherwise>
                    <xsl:text>PASS</xsl:text>
                </xsl:otherwise>
            </xsl:choose>
    
        </xsl:template>
    
        <!-- default matching template for elements
    
             for context node (from "primary"), choose from among $candidates (from "control") which one matches
    
             (for "complex" elements, name has to match, for "simple" elements, name and value do)
    
             (override with more specific match pattern if desired)
             -->
        <xsl:template match="*" mode="find-match" as="element()?">
            <xsl:param name="candidates" as="element()*"/>
            <xsl:choose>
                <xsl:when test="text() and count(node()) = 1">
                    <xsl:sequence select="$candidates[node-name(.) = node-name(current())][text() and count(node()) = 1][. = current()][1]"/>
                </xsl:when>
                <xsl:when test="not(node())">
                    <xsl:sequence select="$candidates[node-name(.) = node-name(current())][not(node())][1]"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:sequence select="$candidates[node-name(.) = node-name(current())][1]"/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:template>
    
        <!-- default matching template for attributes
    
             for context attr (from "primary"), choose from among $candidates (from "control") which one matches
    
             (name and value have to match)
    
             (override with more specific match pattern if desired)
             -->
        <xsl:template match="@*" mode="find-match" as="attribute()?">
            <xsl:param name="candidates" as="attribute()*"/>
            <xsl:sequence select="$candidates[. = current()][node-name(.) = node-name(current())][1]"/>
        </xsl:template>
    
        <!-- default primary-only template (override with more specific match pattern if desired) -->
        <xsl:template match="@* | *" mode="primary-only">
            <xsl:apply-templates select="." mode="illegal-primary-only"/>
        </xsl:template>
    
        <!-- write out a primary-only diff -->
        <xsl:template match="@* | *" mode="illegal-primary-only">
            <primary:only>
                <xsl:copy-of select="."/>
            </primary:only>
        </xsl:template>
    
        <!-- default control-only template (override with more specific match pattern if desired) -->
        <xsl:template match="@* | *" mode="control-only">
            <xsl:apply-templates select="." mode="illegal-control-only"/>
        </xsl:template>
    
        <!-- write out a control-only diff -->
        <xsl:template match="@* | *" mode="illegal-control-only">
            <control:only>
                <xsl:copy-of select="."/>
            </control:only>
        </xsl:template>
    
        <!-- assume primary (context) element and control element match, so render the "common" element and recurse -->
        <xsl:template match="*" mode="common">
            <xsl:param name="control"/>
    
            <xsl:copy>
                <xsl:call-template name="match-attributes">
                    <xsl:with-param name="primary" select="@*"/>
                    <xsl:with-param name="control" select="$control/@*"/>
                </xsl:call-template>
    
                <xsl:choose>
                    <xsl:when test="text() and count(node()) = 1">
                        <xsl:value-of select="."/>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:call-template name="match-children">
                            <xsl:with-param name="primary" select="*"/>
                            <xsl:with-param name="control" select="$control/*"/>
                        </xsl:call-template>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:copy>
    
        </xsl:template>
    
        <!-- find matches between collections of attributes in primary vs control -->
        <xsl:template name="match-attributes">
            <xsl:param name="primary" as="attribute()*"/>
            <xsl:param name="control" as="attribute()*"/>
            <xsl:param name="primaryCollecting" as="attribute()*"/>
    
            <xsl:choose>
                <xsl:when test="$primary and $control">
                    <xsl:variable name="this" select="$primary[1]"/>
                    <xsl:variable name="match" as="attribute()?">
                        <xsl:apply-templates select="$this" mode="find-match">
                            <xsl:with-param name="candidates" select="$control"/>
                        </xsl:apply-templates>
                    </xsl:variable>
    
                    <xsl:choose>
                        <xsl:when test="$match">
                            <xsl:copy-of select="$this"/>
                            <xsl:call-template name="match-attributes">
                                <xsl:with-param name="primary" select="subsequence($primary, 2)"/>
                                <xsl:with-param name="control" select="remove($control, 1 + count(set:leading($control, $match)))"/>
                                <xsl:with-param name="primaryCollecting" select="$primaryCollecting"/>
                            </xsl:call-template>
                        </xsl:when>
                        <xsl:otherwise>
                            <xsl:call-template name="match-attributes">
                                <xsl:with-param name="primary" select="subsequence($primary, 2)"/>
                                <xsl:with-param name="control" select="$control"/>
                                <xsl:with-param name="primaryCollecting" select="$primaryCollecting | $this"/>
                            </xsl:call-template>
                        </xsl:otherwise>
                    </xsl:choose>
    
                </xsl:when>
                <xsl:otherwise>
                    <xsl:if test="$primaryCollecting | $primary">
                        <xsl:apply-templates select="$primaryCollecting | $primary" mode="primary-only"/>
                    </xsl:if>
                    <xsl:if test="$control">
                        <xsl:apply-templates select="$control" mode="control-only"/>
                    </xsl:if>
                </xsl:otherwise>
            </xsl:choose>
    
        </xsl:template>
    
        <!-- find matches between collections of elements in primary vs control -->
        <xsl:template name="match-children">
            <xsl:param name="primary" as="node()*"/>
            <xsl:param name="control" as="element()*"/>
    
            <xsl:variable name="this" select="$primary[1]" as="node()?"/>
    
            <xsl:choose>
                <xsl:when test="$primary and $control">
                    <xsl:variable name="match" as="element()?">
                        <xsl:apply-templates select="$this" mode="find-match">
                            <xsl:with-param name="candidates" select="$control"/>
                        </xsl:apply-templates>
                    </xsl:variable>
    
                    <xsl:choose>
                        <xsl:when test="$match">
                            <xsl:apply-templates select="$this" mode="common">
                                <xsl:with-param name="control" select="$match"/>
                            </xsl:apply-templates>
                        </xsl:when>
                        <xsl:otherwise>
                            <xsl:apply-templates select="$this" mode="primary-only"/>
                        </xsl:otherwise>
                    </xsl:choose>
                    <xsl:call-template name="match-children">
                        <xsl:with-param name="primary" select="subsequence($primary, 2)"/>
                        <xsl:with-param name="control" select="if (not($match)) then $control else remove($control, 1 + count(set:leading($control, $match)))"/>
                    </xsl:call-template>
                </xsl:when>
                <xsl:when test="$primary">
                    <xsl:apply-templates select="$primary" mode="primary-only"/>
                </xsl:when>
                <xsl:when test="$control">
                    <xsl:apply-templates select="$control" mode="control-only"/>
                </xsl:when>
            </xsl:choose>
    
        </xsl:template>
    
    </xsl:stylesheet>
    

    applied to this document (based on your test case), the result is PASS:

    <test>
      <root>
        <tag name="AAA">
          <attr name="b" value="1"/>
          <attr name="c" value="2"/>
          <attr name="a" value="3"/>
        </tag>
        <tag name="BBB">
          <attr name="x" value="111"/>
          <attr name="z" value="222"/>
        </tag>
        <tag name="BBB">
          <attr name="x" value="333"/>
          <attr name="z" value="444"/>
        </tag>
      </root>
      <root>
        <tag name="AAA">
          <attr name="a" value="3"/>
          <attr name="b" value="1"/>
          <attr name="c" value="2"/>
        </tag>
        <tag name="BBB">
          <attr name="z" value="444"/>
          <attr name="x" value="333"/>
        </tag>
        <tag name="BBB">
          <attr name="x" value="111"/>
          <attr name="z" value="222"/>
        </tag>
      </root>
    </test>
    
    0 讨论(0)
  • 2021-02-04 10:50

    You'd have to write your own interpreter to preprocess. XSLT is one way to do it ... maybe; I'm not an expert in XSLT and I'm not sure you can sort things with it.

    Here is a quick and dirty perl script which can do what you want. Note that it's far far far wiser to use a real XML parser. I'm not familiar with any, so I'm exposing you to my terrible practice of writing them myself. Note the comments; you have been warned.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    # NOTE: general wisdom - do not use simple homebrewed XML parsers like this one!
    #
    # This makes sweeping assumptions that are not production grade.  Including:
    #   1. Assumption of one XML tag per line
    #   2. Assumption that no XML tag contains a greater-than character
    #      like <foo bar="<oops>" />
    #   3. Assumes the XML is well-formed, nothing like <foo><bar>baz</foo></bar>
    
    # recursive function to parse each tag.
    sub parse_tag {
      my $tag_name = shift;
      my @level = (); # LOCAL: each recursive call has its OWN distinct @level
      while(<>) {
        chomp;
    
        # new open tag:  match new tag name, parse in recursive call
        if (m"<\s*([^\s/>]+)[^/>]*>") {
          push (@level, "$_\n" . parse_tag($1) );
    
        # close tag, verified by name, or else last line of input
        } elsif (m"<\s*/\s*$tag_name[\s>]"i or eof()) {
          # return all children, sorted and concatenated, then the end tag
          return join("\n", sort @level) . "\n$_";
    
        } else {
          push (@level, $_);
        }
      }
      return join("\n", sort @level);
    }
    
    # start with an impossible tag in case there is no root
    print parse_tag("<root>");
    

    Save that as xml_diff_prep.pl and then run this:

    $ diff -sq <(perl xml_diff_prep.pl a.xml) <(perl xml_diff_prep.pl b.xml)
    Files /proc/self/fd/11 and /proc/self/fd/12 are identical
    

    (I used the -s and -q flags to be explicit. You can use gvimdiff or whatever other utility or flags you like. Note it identifies the files by file descriptor; that's because I used a bash trick to run the preprocessor command on each input. They'll be in the same order you specified. Note that the contents may be in unexpected locations due to the sorting requested by this question.)

    To satisfy your "Open Source" "command line tool" request, I hereby release this code as Open Source under the Beerware License (BSD 2-clause, if you think it's worthwhile, you are welcome to buy me a beer).

    0 讨论(0)
  • 2021-02-04 10:51

    You're requesting a sort based on the sequence of attributes in the elements being sorted. But your top-level tag elements here have only one attribute: name. If you want multiple tag elements with name="BBB" to sort differently, you need to give them distinct sort keys.

    In your example, I'd try something like select="concat(name(), @name, name(*[1]), *[1]/@name)" -- but this is a very shallow key. It uses values from the first child in the input, but the children may shift position during the process. You may be able (knowing your data better than I do) to calculate a good key for each element in a single pass, or you may just need several passes.

    0 讨论(0)
提交回复
热议问题