Python regex not to match http://

给你一囗甜甜゛ 提交于 2019-12-21 12:23:52

问题


I am facing a problem to match and replace certain words, not contained in http://

Present Regex:

 http://.*?\s+

This matches the pattern http://www.egg1.com http://www.egg2.com

I need a regex to match certain words contained outside the http://

Example:

"This is a sample. http://www.egg1.com and http://egg2.com. This regex will only match 
 this egg1 and egg2 and not the others contained inside http:// "

 Match: egg1 egg2

 Replaced: replaced1 replaced2

Final Output :

 "This is a sample. http://www.egg1.com and http://egg2.com. This regex will only 
  match this replaced1 and replaced2 and not the others contained inside http:// "

QUESTION: Need to match certain patterns (as in example : egg1 egg2) unless they are part of http:// .Do not match egg1 and egg2 if they are present within http://


回答1:


One solution I can think of is to form a combined pattern for HTTP-URLs and your pattern, then filter the matches accordingly:

import re

t = "http://www.egg1.com http://egg2.com egg3 egg4"

p = re.compile('(http://\S+)|(egg\d)')
for url, egg in p.findall(t):
  if egg:
    print egg

prints:

egg3
egg4

UPDATE: To use this idiom with re.sub(), just supply a filter function:

p = re.compile(r'(http://\S+)|(egg(\d+))')

def repl(match):
    if match.group(2):
        return 'spam{0}'.format(match.group(3))
    return match.group(0)

print p.sub(repl, t)

prints:

http://www.egg1.com http://egg2.com spam3 spam4



回答2:


This will not capture http://...:

(?:http://.*?\s+)|(egg1)



回答3:


You need to precede your pattern by a negative lookbehind assertion:

(?<!http://)egg[0-9]

In this regular expression, every time the regex engine finds a pattern matching egg[0-9] it will look back to verify if the preceding patters do not match http://. A negative lookbehind assertion starts with (?<! and ends with ). Everything between these delimiters should not precede the following pattern and will not be included in the result.

How to use it in your case:

>>> regex = re.compile('(?<!http://)egg[0-9]')
>>> a = "Example: http://egg1.com egg2 http://egg3.com egg4foo"
>>> regex.findall(a)
['egg2', 'egg4']



回答4:


Extending brandizzi's answer, I would just change his regex to this:

(?<!http://[\w\._-]*)(egg1|egg2)


来源:https://stackoverflow.com/questions/6859763/python-regex-not-to-match-http

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!