Assuming Unicode and case-insensitivity, should the pattern “..” match “FfIsS”?

后端 未结 2 1106
梦谈多话
梦谈多话 2021-02-12 19:54

It sounds like a joke, but I can sort of prove it.

Assumptions:

  • Dot matches any single character.
  • A case-insensitive pattern matches s
2条回答
  •  醉酒成梦
    2021-02-12 20:31

    As maaartinus pointed out in his comment, Java provides (at least in theory) Unicode support for case-insensitive reg-exp matching. The wording in the Java API documentation is that matching is done "in a manner consistent with the Unicode Standard". The problem is however, that the Unicode standard defines different levels of support for case conversion and case-insensitive matching and the API documentation does not specify which level is supported by the Java language.

    Although not documented, at least in Oracle's Java VM, the reg-exp implementation is limited to so called simple case-insensitive matching. The limiting factors relevant to your example data is that the matching algorithm only works as expected if the case folding (conversion) results in the same number of characters and that sets (e.g. ".") are limited to match exactly one character in the input string. The first limitation even leads to "ß" not matching "SS", as you also may have had expected.

    To get support for full case-insensitive matching between string literals, you can use the reg-exp implementation in the ICU4J library, so that at least "ß" and "SS" matches. AFAIK, there are however no reg-exp implementations for Java with full support for groups, sets and wild cards.

提交回复
热议问题