I am trying to write a regex in python to find urls in a Markdown text string. Once a url is found, I want to check if this is wrapped by a markdown link: text I am having pro
The problem here is your regex for pulling out the URL's in the first place, which is including )
inside the URLs. This means you are looking for the closing parenthesis twice. This happens for everything bar the first one (the space saves you there).
I'm not quite sure what each part of your URL regex is trying to do, but the portion that says:
[$-_@.&+]
, is including a range from $
(ASCII 36) to _
(ASCII 137), which includes a huge number of characters you probably don't mean, including the )
.
Instead of looking for URLs, and then checking to see if they are in the link, why not do both at once? This way your URL regex can be lazier, because the extra constraints make it less likely to be anything else:
# Anything that isn't a square closing bracket
name_regex = "[^]]+"
# http:// or https:// followed by anything but a closing paren
url_regex = "http[s]?://[^)]+"
markup_regex = '\[({0})]\(\s*({1})\s*\)'.format(name_regex, url_regex)
for match in re.findall(markup_regex, text):
print match
Result:
('Vocoder', 'http://en.wikipedia.org/wiki/Vocoder ')
('Turing', 'http://en.wikipedia.org/wiki/Alan_Turing')
('Autotune', 'http://en.wikipedia.org/wiki/Autotune')
You could probably improve the URL regex if you need to be stricter.