What is the regular expression for the set of strings that validate exactly the same for xsd:token and xsd:string?

*爱你&永不变心* 提交于 2019-12-08 04:21:43

问题


I want write an XSD to restrict the content of valid XML elements of type xsd:token such that at validation they would indistinguishable from the same content wrapped in xsd:string.

I.e. they do not contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, begin or end with a space (#x20) character, and do not include a sequence of two or more adjacent space characters.

I think the regular expression to use is this:

\S+( \S+)*

(some non-whitespace, optional [single spaces next to one or more non-whitespaces], including always non-whitespace to close out)

This works with various regex testing tools but I can't seem to check it using oXygen XML Editor; double spaces, leading and trailing spaces, tabs, and line breaks in the strings seem to allow the XML instance to still pass validation.

Here's the XSD implementation:

<xs:simpleType name="Tokenized500Type">
    <xs:restriction base="xs:token">
      <xs:maxLength value="500"/>
      <xs:minLength value="1"/>
      <xs:pattern value="\S+( \S+)*"/>
    </xs:restriction>
  </xs:simpleType>

Is there some feature of

  • XML

or

  • XSD

or

  • oXygen XML Editor

that prevents this working?


回答1:


Your original ([^\s])+( [^\s]+)*([^\s])* regex contains some redundant patterns: it matches and captures each iteration of 1+ non-whitespaces, then matches 0+ sequences of space and 1+ non-whitespaces, and then again tries to match and capture each iteration of a non-whitespace.

You may use a similar, but shorter

\S+( \S+)*

Since XML Schema regex is anchored by default, there expression matches:

  • \S+ - one or more chars other than whitespace, specifically &#20; (space), \t (tab), \n (newline) and \r (return)
  • ( \S+)* - zero or more sequences of a space and 1+ whitespaces.

This expression disallows duplicate consecutive spaces and no spaces at leading/trailing position.

Here is how the regex should be used:

<xs:simpleType name="Tokenized500Type">
  <xs:restriction base="xs:string">
    <xs:pattern value="\S+( \S+)*"/>
    <xs:maxLength value="500"/>
    <xs:minLength value="1"/>
  </xs:restriction>
</xs:simpleType>



回答2:


The base type needs to be xsd:string.

Using xsd:Token tokenizes the input, THEN checks if it's a token. That is redundant.



来源:https://stackoverflow.com/questions/40346316/what-is-the-regular-expression-for-the-set-of-strings-that-validate-exactly-the

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!