python regex fails to identify markdown links

北城以北 提交于 2021-02-06 13:55:00

问题


I am trying to write a regex in python to find urls in a Markdown text string. Once a url is found, I want to check if this is wrapped by a markdown link: text I am having problem with the latter. I am using a regex - link_exp - to search, but the results are not what I expected, and cannot get my head around it.

This is probably something simple that I am not seeing.

here goes the code and explanation of the link_exp regex

import re

text = '''
[Vocoder](http://en.wikipedia.org/wiki/Vocoder )
[Turing]( http://en.wikipedia.org/wiki/Alan_Turing)
[Autotune](http://en.wikipedia.org/wiki/Autotune)
http://en.wikipedia.org/wiki/The_Voder
'''

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text) #find all urls
for url in urls:
    url = re.escape(url)
    link_exp = re.compile('\[.*\]\(\s*{0}\s*\)'.format(url) ) # expression with url wrapped in link syntax.     
    search = re.search(link_exp, text)
    if search != None:
        print url

# expression should translate to:
# \[ - literal [
# .* - any character or no character 
# \] - literal ]
# \( - literal (
# \s* - whitespaces or no whitespace 
# {0} - the url
# \s* - whitespaces or no whitespace 
# \) - literal )
# NOTE: I am including whitespaces to encompass cases like [foo]( http://www.foo.sexy   )  

the output I get is only:

http\:\/\/en\.wikipedia\.org\/wiki\/Vocoder

which means the expression is only finding the link with a whitespace before the closing parenthesis. This is not only what I want to, but only one case links without white spaces should be considered.

Do you think you can help me on this one?
cheers


回答1:


The problem here is your regex for pulling out the URL's in the first place, which is including ) inside the URLs. This means you are looking for the closing parenthesis twice. This happens for everything bar the first one (the space saves you there).

I'm not quite sure what each part of your URL regex is trying to do, but the portion that says: [$-_@.&+], is including a range from $ (ASCII 36) to _ (ASCII 137), which includes a huge number of characters you probably don't mean, including the ).

Instead of looking for URLs, and then checking to see if they are in the link, why not do both at once? This way your URL regex can be lazier, because the extra constraints make it less likely to be anything else:

# Anything that isn't a square closing bracket
name_regex = "[^]]+"
# http:// or https:// followed by anything but a closing paren
url_regex = "http[s]?://[^)]+"

markup_regex = '\[({0})]\(\s*({1})\s*\)'.format(name_regex, url_regex)

for match in re.findall(markup_regex, text):
    print match

Result:

('Vocoder', 'http://en.wikipedia.org/wiki/Vocoder ')
('Turing', 'http://en.wikipedia.org/wiki/Alan_Turing')
('Autotune', 'http://en.wikipedia.org/wiki/Autotune')

You could probably improve the URL regex if you need to be stricter.



来源:https://stackoverflow.com/questions/23394608/python-regex-fails-to-identify-markdown-links

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!