Java regex - expression with exactly one whitespace

后端 未结 6 1878
攒了一身酷
攒了一身酷 2021-01-14 12:08

I want to match all expressions with exactly one whitespace. Currently, I\'m using [^\\\\s]*\\\\s[^\\\\s]*. That doesn\'t seem like a very good way, though.

相关标签:
6条回答
  • 2021-01-14 12:36

    Another way to do it, if you don't want to go the regex way (possible performance increase):

    String s = "one whitespace";
    
    
    public boolean hasOneWhitespace(String s) {
       int count = 0;
       for (int i = 0; i < s.length(); i++) {
          if(s.charAt(i) == ' ') {
             count++;
             if (count > 1) return false;
          }
       }
       return count == 1;   
    }
    

    Of course, this will work only if you consider " " to be whitespace. Tabs and newlines won't work.

    0 讨论(0)
  • 2021-01-14 12:37

    I want to match all expressions with exactly one whitespace.

    The correct pattern for finding out whether any whitespace occurs in a Java string is:

    \A[^\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]*+[\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000][\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]*+\z
    

    The other answers provided here do not correctly answer the question asked.

    Here are all the Unicode whitespace characters, along with their ages (meaning, which Unicode release they first appeared in) and their binary properties that are related to spacing issues.

    U+0009 CHARACTER TABULATION
        \s \h \pC \p{Cc}
        Age=1.1 HorizSpace Pattern_White_Space Space White_Space
    U+000A LINE FEED (LF)
        \s \v \R \pC \p{Cc}
        Age=1.1 Pattern_White_Space Space VertSpace White_Space
    U+000B LINE TABULATION 
        \v \R \pC \p{Cc}
        Pattern_White_Space Space VertSpace White_Space 
    U+000C FORM FEED (FF)
        \s \v \R \pC \p{Cc}
        Age=1.1 Pattern_White_Space Space VertSpace White_Space
    U+000D CARRIAGE RETURN (CR)
        \s \v \R \pC \p{Cc}
        Age=1.1 Pattern_White_Space Space VertSpace White_Space
    U+0020 SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Pattern_White_Space Space Space_Separator White_Space
    U+0085 NEXT LINE (NEL)
        \s \v \R \pC \p{Cc}
        Age=1.1 Pattern_White_Space Space VertSpace White_Space
    U+00A0 NO-BREAK SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+1680 OGHAM SPACE MARK
        \s \h \pZ \p{Zs}
        Age=3.0 HorizSpace Space Space_Separator White_Space
    U+180E MONGOLIAN VOWEL SEPARATOR
        \s \h \pZ \p{Zs}
        Age=3.0 HorizSpace Space Space_Separator White_Space
    U+2000 EN QUAD
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+2001 EM QUAD
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+2002 EN SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+2003 EM SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+2004 THREE-PER-EM SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+2005 FOUR-PER-EM SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+2006 SIX-PER-EM SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+2007 FIGURE SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+2008 PUNCTUATION SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+2009 THIN SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+200A HAIR SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    U+2028 LINE SEPARATOR
        \s \v \R \pZ \p{Zl}
        Age=1.1 Pattern_White_Space Space VertSpace White_Space
    U+2029 PARAGRAPH SEPARATOR
        \s \v \R \pZ \p{Zp}
        Age=1.1 Pattern_White_Space Space VertSpace White_Space
    U+202F NARROW NO-BREAK SPACE
        \s \h \pZ \p{Zs}
        Age=3.0 HorizSpace Space Space_Separator White_Space
    U+205F MEDIUM MATHEMATICAL SPACE
        \s \h \pZ \p{Zs}
        Age=3.2 HorizSpace Space Space_Separator White_Space
    U+3000 IDEOGRAPHIC SPACE
        \s \h \pZ \p{Zs}
        Age=1.1 HorizSpace Space Space_Separator White_Space
    

    Note that all but four were present ever since way way way back in Unicode 1.1. U+1680 OGHAM SPACE MARK, U+180E MONGOLIAN VOWEL SEPARATOR, and U+202F NARROW NO-BREAK SPACE entered The Unicode Standard with release 3.0, and U+205F MEDIUM MATHEMATICAL SPACE first appeared with the 3.2 release. There have been no more added since that time.

    The \p{Whitespace} property is required for compliance with UTS#18 RL1.2 “Properties”, and the \p{space} alias and the \s shortcut for whitespace are both required for compliance with UTS#18 RL1.2a “Compatibility Properties”.

    As explained in The Unicode Standard 6.0.0’s Conformance document, the White_Space property is a normative property, not an informative, contributatory, or provisional property. Because it is a normative property, you are strictly required to use these values to correctly process all Unicode character data according to The Unicode Standard.

    Nothing in j.u.r.Pattern provides functionality conformant with The Unicode Standard in this regard. In fact, Java’s regexes fail to meet half the mandatory requirements necessary for even the very lowest possible level of compliance set forth in UTS #18: Unicode Regular Expressions. That minimum level is Level 1, about which is written:

    Level 1 is the minimally useful level of support for Unicode. All regex implementations dealing with Unicode should be at least at Level 1.

    Because Java’s regexes fail to meet even these very barest of minimal requirements indispensable for dealing with Unicode, Java’s regexes are not minimally useful for dealing with Unicode. You must therefore resort such explicit enumerations as given above if you hope to produce conformant behaviour. You might care to consider using my pattern-rewriting library.

    0 讨论(0)
  • 2021-01-14 12:44

    Why not? It's fine, just a bit overcomplicated:

    \\S*\\s\\S*
    
    0 讨论(0)
  • 2021-01-14 12:52

    You could also check it with indexOf:

    String s = "some text";
    int indexOf = s.indexOf(' ');
    boolean isOneWhitespace = (indexOf >= 0 && indexOf == s.lastIndexOf(' '));
    
    0 讨论(0)
  • 2021-01-14 12:53
    String[] ss = { " ", "abc", "a bc", "a b c d" };
    Matcher m = Pattern.compile("^\\S*\\s\\S*$").matcher("");
    for (String s : ss)
    {
      if (m.reset(s).matches())
      {
        System.out.printf("%n>>%s<< OK%n", s);
      }
    }
    

    output:

    >> << OK
    
    >>a bc<< OK
    
    0 讨论(0)
  • 2021-01-14 12:59

    Use transliterate. It has to be an independent test, the regex you have above cannot be combined with a larger regex and still test for a single whitespace.

    Transliterate is 10-20 times faster than a regex for this test.
    This is a jtr example:

    String aInput = "This is a test, 123.";
    CharacterReplacer cReplacer = Perl5Parser.makeReplacer( "tr[ \\t\\r\\n\\f\\x0B][ \\t\\r\\n\\f\\x0B]" );
    String aResult = cReplacer.doReplacement( aInput );
    int nMatches = cReplacer.getMatches();
    
    if (nMatches == 1) { ... }
    
    0 讨论(0)
提交回复
热议问题