Use Regex re.sub to remove everything before and including a specified word

后端 未结 3 1153
执笔经年
执笔经年 2021-01-19 00:48

I\'ve got a string, which looks like \"Blah blah blah, Updated: Aug. 23, 2012\", from which I want to use Regex to extract just the date Aug. 23, 2012. I found

相关标签:
3条回答
  • 2021-01-19 01:45

    In this case, you can do it withot regex, e.g:

    >>> date_div = "Blah blah blah, Updated: Aug. 23, 2012"
    >>> date_div.split('Updated: ')
    ['Blah blah blah, ', 'Aug. 23, 2012']
    >>> date_div.split('Updated: ')[-1]
    'Aug. 23, 2012'
    
    0 讨论(0)
  • 2021-01-19 01:46

    With a regex, you may use two regexps depending on the occurrence of the word:

    # Remove all up to the first occurrence of the word including it (non-greedy):
    ^.*?word
    # Remove all up to the last occurrence of the word including it (greedy):
    ^.*word
    

    See the non-greedy regex demo and a greedy regex demo.

    The ^ matches the start of string position, .*? matches any 0+ chars (mind the use of re.DOTALL flag so that . could match newlines) as few as possible (.* matches as many as possible) and then word matches and consumes (i.e. adds to the match and advances the regex index) the word.

    Note the use of re.escape(up_to_word): if your up_to_word does not consist of sole alphanumeric and underscore chars, it is safer to use re.escape so that special chars like (, [, ?, etc. could not prevent the regex from finding a valid match.

    See the Python demo:

    import re
    
    date_div = "Blah blah\nblah, Updated: Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019"
    
    up_to_word = "Updated:"
    rx_to_first = r'^.*?{}'.format(re.escape(up_to_word))
    rx_to_last = r'^.*{}'.format(re.escape(up_to_word))
    
    print("Remove all up to the first occurrence of the word including it:")
    print(re.sub(rx_to_first, '', date_div, flags=re.DOTALL).strip())
    print("Remove all up to the last occurrence of the word including it:")
    print(re.sub(rx_to_last, '', date_div, flags=re.DOTALL).strip())
    

    Output:

    Remove all up to the first occurrence of the word including it:
    Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019
    Remove all up to the last occurrence of the word including it:
    Feb. 13, 2019
    
    0 讨论(0)
  • 2021-01-19 01:53

    You can use Lookahead:

    import re
    date_div = "Blah blah blah, Updated: Aug. 23, 2012"
    extracted_date = re.sub('^(.*)(?=Updated)',"", date_div)
    print extracted_date
    

    OUTPUT

    Updated: Aug. 23, 2012
    

    EDIT
    If MattDMo's comment below is correct and you want to remove the "Update: " as well you can do:

    extracted_date = re.sub('^(.*Updated: )',"", date_div)
    
    0 讨论(0)
提交回复
热议问题