How to verify that one XSD schema is a subset of another XSD schema?

后端 未结 4 1526
轮回少年
轮回少年 2021-02-04 15:43

How can I verify that one XSD schema is a subset of another XSD schema?

We are creating a system-of-systems application using a collection of \"blueprint\" XSD schemas (

相关标签:
4条回答
  • Thank you, @13ren, for your “beep” :)

    This is a long comment, rather than an answer. I'll start from my earlier exchange with 13ren, more precise, it provides to a user all that is needed to define such an analysis model. What I meant is that in QTAssistant (this one) we have an XSD compare function; being XSD-aware, it does already a lot of things that text or XML-aware diff tool can't do (e.g. it doesn’t care how many XSD files, their layout changes between versions, etc.) For the provided UI, the diff engine works against the source model as opposed to the PSVI one. We could customize it to use the PSVI instead, since the latter is one step closer to what you actually need. We could also include the ability to have a custom rule set augment the compare between “base” and “revision”, in other words to allow the user to override the “=” operator we currently use.

    I recognize that we don’t have anything out of the box allowing to override the compare of xsd:pattern facets; nor for something that to a human is easy to recognize such as xsd:positiveInteger vs. xsd:integer + xsd:minInclusive=1. Or comparing an xsd:all to a xsd:choice or xsd:sequence; and for the same we don’t parse out selectors and fields for XSD constraints which, much like regular expressions, wouldn't be easy to deal with.

    Assuming that the goal is to find as many “discrepancies” as possible as opposed to rule them out entirely, QTAssistant has three more features which are helpful:

    • for a given root element, it creates the complete list of simple XPaths. It can be applied as a quick way to spot "rogue" data. Out of the box this method of comparing doesn’t take into account structural patterns, i.e. if XPath1 and XPath2 denote two siblings in an instance XML, that XPath1 must precede XPath2), etc.
    • It comes with a built in Query XSD Analyser. SQL can be used to query the XSD metamodel of a set to "spot" things that point out things a compare tool may be designed to ignore (for feasibility) and would therefore require a report to a human to decide.
      • XSD Refactoring (XSR). It is the only engine in the industry (that I know of at least) that is built from ground up with XSD refactoring and analysis in mind. I would think that if you can rule out xsi:type and, ideally, the use of substitution groups as well (on this one I still have to think) we could make available what we call the “canonicalization transformation” – a fancy word to convert a schema set to a Russian Doll design style, by relying on the PSVI model instead. There are many things that could be at play here: the use of id attribute, collapse of superfluous sequences, replacement of single option xsd:choices, etc. - which is why we have it in developement, but not published yet.

    Another thing for which we had to provision in our compare (and you may want to consider) had to do with equivalence not only of the XSD/XML, but of the artifacts generated from XSD (e.g. Java classes through JAXB); to top it off, extensibility patterns, those that make use of wild cards (xsd:any and anyAttribute).

    We (QTAssistant) are currently interested to work with you through some more specific requirements (we would need to start with an exchange of representative XSDs, NDAs I would assume, etc.), out of band, to see if indeed we could make it work. If you wish to proceed, feel free to contact me through the support address of the website associated with my SO profile.

    0 讨论(0)
  • 2021-02-04 16:20

    Tools that validate XML against a schema already know how to do this, because in the case of an <xs:complexContent><xs:restriction>, the newly defined type is required to be a subset of the type being restricted.

    If you'd like to tap into this functionality, you can have the child schemas define complex types that restrict types in your blueprint schema.

    If the child schemas are created without this in mind, however, this could possibly still be accomplished by modifying the child schemas to match the pattern below, then sending them through a schema processor for verification.

    Example blueprint schema, blueprintschema.xsd:

    <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
      <xs:element name="root" type="rootType"/>
      <xs:complexType name="rootType">
        <xs:sequence>
          <xs:element name="child1" minOccurs="0"/>
          <xs:element name="child2" minOccurs="0"/>
          <xs:element name="child3" minOccurs="0"/>
        </xs:sequence>
      </xs:complexType>
    </xs:schema>
    

    Example child schema that is a subset of blueprint schema:

    <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
      <xs:element name="root" type="rootType"/>
      <xs:complexType name="rootType">
        <xs:sequence>
          <xs:element name="child2"/>
        </xs:sequence>
      </xs:complexType>
    </xs:schema>
    

    Example child schema after transformation into a redefine construct:

    <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
      <xs:redefine schemaLocation="blueprintschema.xsd">
        <xs:complexType name="rootType">
          <xs:complexContent>
            <xs:restriction base="rootType">
              <xs:sequence>
                <xs:element name="child2"/>
              </xs:sequence>
            </xs:restriction>
          </xs:complexContent>
        </xs:complexType>
      </xs:redefine>
      <xs:element name="root" type="rootType"/>
    </xs:schema>
    

    A schema processor will then tell you whether the redefined "rootType" is actually a subset of the original blueprint "rootType"

    Since a schema is just XML, the transformation can be done using normal XML processing tools.

    0 讨论(0)
  • 2021-02-04 16:25

    The simplest way to ensure the relationship you want is to derive the types of the subset schemas by restriction from the types of the blueprint schema. It sounds as if that boat has already sailed, though.

    Like others here, I am not aware of any tools that do this out of the box (although if Petru Gardea says QT Assistant can, it's worth following up).

    One complication is that there are two different ways to view the subset/superset relation you want to verify: (1) every document (or element) accepted as valid by schema 1 is also accepted as valid by schema 2 (without reference to the type assignments made), or (2) the typed documents produced by validation (in what the spec calls the post-schema-validation infoset) against schemas 1 and 2 stand in an appropriate relation to each other: if an element or attribute is valid in tree 1, it's valid in tree 2; the type assigned to it in tree 1 is a restriction of the type assigned to it in tree 2; etc. If schemas 1 and 2 were developed independently, the chances that their types are related by derivation are poor, so I guess you have the first approach to the question in mind.

    The problem, though, is definitely decidable, in either form. For any schema (I'm using the term carefully) there are by definition a finite number of types and a finite number of element names declared; it follows that there is a finite number (possibly large) of element name / type pairs.

    The algorithm can go something like this.

    1. Start with the expected root element. (If there are multiple possible root elements, then in the general case you'll need to run this check for each of them.) If the expected root element is E, with type T1 in schema 1 and type T2 in schema 2, then place the task "Compare type T1 and T2" in a queue of open tasks. The list of tasks already completed will be empty.

    2. To compare two complex types T1 and T2:

      • Check the sets of attributes declared for T1 and T2 for a subset/superset relation between their names. Make sure no attribute required in the intended superset is absent or optional in the intended subset.

      • Each attribute A declared for both T1 and T2 will be assigned a type (call them ST1 and ST2). If ST1 = ST2, do nothing; otherwise, add the task "Compare simple types ST1 and ST2" to the queue of open tasks, unless it's on the list of comparisons already completed.

      • Now check the sequences of children that are possible in T1 and T2 -- as 13ren suggests in a comment, this is tractable since content models are essentially regular expressions which use the set of element names as their alphabet; the languages they define are therefore regular, and the subset/superset relation is decidable for regular languages.

      • Each possible child element C is assigned both an element declaration and a type definition by the parent types T1 and T2. Let us call them ED1, ED2, CT1, and CT2. Every child of the same name will have the same type, but different children may match different element declarations. So for any possible name, there will be just one pair of types CT1 and CT2, but there may be multiple pairs ED1 and ED2 (and the analysis will need to be careful to make sure they are matched up correctly; that might be hard to automate).

      • If CT1 = CT2, do nothing, otherwise put "Compare types CT1 and CT2" onto the open task queue, unless the comparison has already been performed.

      • If ED1 and ED2 are structurally identical, do nothing; otherwise put the task of comparing them into the task queue (unless it's already been done).

    3. To compare two simple types ST1 and ST2, compare either their lexical spaces (if you want the first definition of the subset/superset relation on schemas) or their value spaces (if you want the second). If ST1 and ST2 are both restrictions of the same primitive type, you may be able to compare the set of effective facet-based restrictions on them easily. The pattern facet may complicate matters, but because it defines a set of regular expressions, the subset/superset relation is decidable for it.

    4. To compare two element declarations, you need to compare each of the properties for the element declaration and check for the desired subset/superset relation.

    As you can see, it's complex and tedious enough that you really want to automate this analysis, and it's also complex enough that it's easy to see why it's not widely offered as out-of-the-box function. But it would certainly be interesting to code.

    0 讨论(0)
  • 2021-02-04 16:26

    Since currently there is no available solution of validating/checking schema against another schema, looks like we have to use workarounds. Below is my attempt.

    Restating the problem:

    Identify whether all the data types defined in a subset schema exists and are within the (less stricter) bounds of what is defined in a "blueprint" schema.

    A possible solution:

    1. First piece of (obvious) information is that Schema allows us to create XML instances (what does that even mean?! see 2.).
    2. Another piece of (not so obvious) information we have is that XML schema itself can be a "subset instance" - what I mean by that is: if you were to reverse engineer a schema from your XML instance, then you have a subset schema (this statement is not always true if you do not have "lists" or choice elements or other “confines”, then the reversed engineered subset schema will map exactly to the "blueprint" schema).

    So, using this knowledge, we can construct a solution:

    Create all possible XML instances of a subset schema (doing this step programmatically can be a challenge), and validate these XML instances against the "blueprint" schema.

    But how you know that the subset schema is a subset of a "blueprint" schema? Well, since you created all possible XML instances of a subset schema, it covered all the data types that are there in a subset schema. And validating these XML instances against the "blueprint" schema, inherently is checking whether the data types exists and all is within the bounds of what is defined in the "blueprint" schema. Hence, identifying that your subset schema is indeed a subset of a "blueprint" schema.

    I know its not an ideal solution, but hope this helps given that there is no simple way available to do what you are asking.

    0 讨论(0)
提交回复
热议问题