Match a word using regex that also handles apostrophes

前端 未结 2 1235
[愿得一人]
[愿得一人] 2021-01-13 18:34

I have to separate a line of text into words, and am confused on what regex to use. I have looked everywhere for a regex that matches a word and found ones similar to this

相关标签:
2条回答
  • 2021-01-13 19:21

    Using answer from WhirlWind on the page stated in my comment you can do the following:

    String candidate = "I \n"+
        "like \n"+
        "to "+
        "eat "+
        "but "+
        "I "+
        "don't "+
        "like "+
        "to "+
        "eat "+
        "everyone's "+
        "food "+
        "''  ''''  '.' ' "+
        "or "+
        "they'll "+
        "starv'e'";
    
    String regex = "('\\w+)|(\\w+'\\w+)|(\\w+')|(\\w+)";
    Matcher matcher = Pattern.compile(regex).matcher(candidate);
    while (matcher.find()) {
      System.out.println("> matched: `" + matcher.group() + "`");
    }
    

    It will print:

    > matched: `I`
    > matched: `like`
    > matched: `to`
    > matched: `eat`
    > matched: `but`
    > matched: `I`
    > matched: `don't`
    > matched: `like`
    > matched: `to`
    > matched: `eat`
    > matched: `everyone's`
    > matched: `food`
    > matched: `or`
    > matched: `they'll`
    > matched: `starv'e`
    

    You can find a running example here: http://ideone.com/pVOmSK

    0 讨论(0)
  • 2021-01-13 19:22

    The following regex seems to cover your sample string correctly. But it doesn't cover you scenario for the apostrophe.

    [\s,.?!"]+
    

    Java Code:

    String input = "I like to eat but I don't like to eat everyone's food, or they'll starve.";
    String[] inputWords = input.split("[\\s,.?!]+");
    

    If I understand correctly, the apostrophe should be left alone as long as it is after a word character. This next regex should cover the above plus the special case for the apostrophe.

    (?<!\w)'|[\s,.?"!][\s,.?"'!]*
    

    Java Code:

    String input = "I like to eat but I don't like to eat everyone's food, or they'll starve.";
    String[] inputWords = input.split("(?<!\\w)'|[\\s,.?\"!][\\s,.?\"'!]*");
    

    If I run the second regex on the string: Hey there! Don't eat 'the mystery meat'. I get the following words in my string array:

    Hey
    there
    Don't
    eat
    the
    mystery
    meat'
    
    0 讨论(0)
提交回复
热议问题