Regex that matches punctuation at the word boundary including underscore

前端 未结 2 1351
灰色年华
灰色年华 2020-12-20 00:34

I am looking for a Python regex for a variable phrase with the following properties: (For the sake of example, let\'s assume the variable phrase here is taking the value

相关标签:
2条回答
  • 2020-12-20 00:48

    You may use

    r'(?<![^\W_])and(?![^\W_])'
    

    See the regex demo. Compile with the re.I flag to enable case insensitive matching.

    Details

    • (?<![^\W_]) - the preceding char should not be a letter or digit char
    • and - some keyword
    • (?![^\W_]) - the next char cannot be a letter or digit

    Python demo:

    import re
    strs = ['this_and', 'this.and', '(and)', '[and]', 'and^', ';And', 'land', 'andy']
    phrase = "and"
    rx = re.compile(r'(?<![^\W_]){}(?![^\W_])'.format(re.escape(phrase)), re.I)
    for s in strs:
        print("{}: {}".format(s, bool(rx.search(s))))
    

    Output:

    this_and: True
    this.and: True
    (and): True
    [and]: True
    and^: True
    ;And: True
    land: False
    andy: False
    
    0 讨论(0)
  • 2020-12-20 01:07

    Here is a regex that might solve it:

    Regex

    (?<=[\W_]+|^)and(?=[\W_]+|$)
    

    Example

    # import regex
    
    string = 'this_And'
    test = regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', string.lower())
    print(test.group(0))
    # prints 'and'
    
    # No match
    string = 'Andy'
    test = regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', string.lower())
    print(test)
    # prints None
    
    strings = [ "this_and", "this.and", "(and)", "[and]", "and^", ";And"]
    [regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', s.lower()).group(0) for s in strings if regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', s.lower())]
    # prints ['and', 'and', 'and', 'and', 'and', 'and']
    

    Explanation

    [\W_]+ means we accept before (?<=) or after (?=) and only non-word symbols except the underscore _ (a word symbol that) is accepted. |^ and |$ allow matches to lie at the edge of the string.

    Edit

    As mentioned in my comment, the module regex does not yield errors with variable lookbehind lengths (as opposed to re).

    # This works fine
    # import regex
    word = 'and'
    pattern = r'(?<=[\W_]+|^){}(?=[\W_]+|$)'.format(word.lower())
    string = 'this_And'
    regex.search(pattern, string.lower())
    

    However, if you insist on using re, then of the top of my head I'd suggest splitting the lookbehind in two (?<=[\W_])and(?=[\W_]+|$)|^and(?=[\W_]+|$) that way cases where the string starts with and are captured as well.

    # This also works fine
    # import re
    word = 'and'
    pattern = r'(?<=[\W_]){}(?=[\W_]+|$)|^{}(?=[\W_]+|$)'.format(word.lower(), word.lower())
    string = 'this_And'
    re.search(pattern, string.lower())
    
    0 讨论(0)
提交回复
热议问题