Regular Expression Wildcard Matching

前端 未结 9 1387
甜味超标
甜味超标 2020-12-15 04:13

I have a list of about 120 thousand english words (basically every word in the language).

I need a regular expression that would allow searching through these words

相关标签:
9条回答
  • 2020-12-15 04:45

    Here is a way to transform wildcard into regex:

    1. Prepend all special characters ([{\^-=$!|]}).+ with \ - so they are matched as characters and don't make user experience unexpected. Also you could enclose it within \Q (which starts the quote) and \E (which ends it). Also see paragraph about security.
    2. Replace * wildcard with \S*
    3. Replace ? wildcard with \S?
    4. Optionally: prepend pattern with ^ - this will enforce exact match with the beginning.
    5. Optionally: append $ to pattern - this will enforce exact match with the end.

      \S - stand for non-space character, which happens zero or more times.

    Consider using reluctant (non-greedy) quantifiers if you have characters to match after * or +. This can be done by adding ? after * or + like this: \S*? and \S*+?

    Consider security: user will send you code to run (because regex is kind of a code too, and user string is used as the regex). You should avoid passing unescaped regex to any other parts of application and only use to filter data retrieved by other means. Because if you do user can affect speed of your code by supplying different regex withing wildcard string - this could be used in DoS attacks.

    Example to show execution speeds of similar patterns:

    seq 1 50000000 > ~/1
    du -sh ~/1
    563M
    time grep -P '.*' ~/1 &>/dev/null
    6.65s
    time grep -P '.*.*.*.*.*.*.*.*' ~/1 &>/dev/null
    12.55s
    time grep -P '.*..*..*..*..*.*' ~/1 &>/dev/null
    31.14s
    time grep -P '\S*.\S*.\S*.\S*.\S*\S*' ~/1 &>/dev/null
    31.27s
    

    I'd suggest against using .* simply because it can match anything, and usually things are separated with spaces.

    0 讨论(0)
  • 2020-12-15 04:45

    Replace * with .* (the regex equivalent of "0 or more of any character").

    0 讨论(0)
  • 2020-12-15 04:47

    . is an expression that matches any one character, as you've discovered. In your hours of searching, you undoubtedly also stumbled across *, which is a repetition operator that when used after an expression matches the preceding expression zero or more times in a row.

    So the equivalent to your meaning of * is putting these two together: .*. This then means "any character zero or more times".

    See the Regex Tutorial on repetition operators.

    0 讨论(0)
  • 2020-12-15 04:54
    1. Replace all '?' characters with '\w'
    2. Replace all '*' characters with '\w*'

    The '*' operator repeats the previous item '.' (any character) 0 or more times.

    This assumes that none of the words contain '.', '*', and '?'.

    This is a good reference

    http://www.regular-expressions.info/reference.html

    0 讨论(0)
  • 2020-12-15 04:56

    Unless you want some funny behaviour, I would recommend you use \w instead of .

    . matches whitespace and other non-word symbols, which you might not want it to do.

    So I would replace ? with \w and replace * with \w*

    Also if you want * to match at least one character, replace it with \w+ instead. This would mean that ben* would match bend and bending but not ben - it's up to you, just depends what your requirements are.

    0 讨论(0)
  • 2020-12-15 04:57

    Take a look at this library: https://github.com/alenon/JWildcard

    It wraps all not wildcard specific parts by regex quotes, so no special chars processing needed: This wildcard:

    "mywil?card*"
    

    will be converted to this regex string:

    "\Qmywil\E.\Qcard\E.*"
    

    If you wish to convert wildcard to regex string use:

    JWildcard.wildcardToRegex("mywil?card*");
    

    If you wish to check the matching directly you can use this:

    JWildcard.matches("mywild*", "mywildcard");
    

    Default wildcard rules are "?" -> ".", "" -> ".", but you can change the default behaviour if you wish, by simply defining the new rules.

    JWildcard.wildcardToRegex(wildcard, rules, strict);
    

    You can use sources or download it directly using maven or gradle from Bintray JCenter: https://bintray.com/yevdo/jwildcard/jwildcard

    Gradle way:

    compile 'com.yevdo:jwildcard:1.4'
    

    Maven way:

    <dependency>
      <groupId>com.yevdo</groupId>
      <artifactId>jwildcard</artifactId>
      <version>1.4</version>
    </dependency>
    
    0 讨论(0)
提交回复
热议问题