Detect URLs in a string and wrap with “<a

后端 未结 2 1357
伪装坚强ぢ
伪装坚强ぢ 2021-01-04 20:24

I am looking to write something that seems like it should be easy enough, but for whatever reason I\'m having a tough time getting my head around it.

I am looking to

相关标签:
2条回答
  • 2021-01-04 20:59

    The "regex magic" you need is just sub (which does a substitution):

    def encode_string_with_links(unencoded_string):
      return URL_REGEX.sub(r'<a href="\1">\1</a>', unencoded_string)
    

    URL_REGEX could be something like:

    URL_REGEX = re.compile(r'''((?:mailto:|ftp://|http://)[^ <>'"{}|\\^`[\]]*)''')
    

    This is a pretty loose regex for URLs: it allows mailto, http and ftp schemes, and after that pretty much just keeps going until it runs into an "unsafe" character (except percent, which you want to allow for escapes). You could make it more strict if you need to. For example, you could require that percents are followed by a valid hex escape, or only allow one pound sign (for the fragment) or enforce the order between query parameters and fragments. This should be enough to get you started, though.

    0 讨论(0)
  • 2021-01-04 21:02

    Googled solutions:

    #---------- find_urls.py----------#
    # Functions to identify and extract URLs and email addresses
    
    import re
    
    def fix_urls(text):
        pat_url = re.compile(  r'''
                         (?x)( # verbose identify URLs within text
             (http|ftp|gopher) # make sure we find a resource type
                           :// # ...needs to be followed by colon-slash-slash
                (\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
                          (/?| # could be just the domain name (maybe w/ slash)
                    [^ \n\r"]+ # or stuff then space, newline, tab, quote
                        [\w/]) # resource name ends in alphanumeric or slash
             (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
                             ) # end of match group
                               ''')
        pat_email = re.compile(r'''
                        (?xm)  # verbose identify URLs in text (and multiline)
                     (?=^.{11} # Mail header matcher
             (?<!Message-ID:|  # rule out Message-ID's as best possible
                 In-Reply-To)) # ...and also In-Reply-To
                        (.*?)( # must grab to email to allow prior lookbehind
            ([A-Za-z0-9-]+\.)? # maybe an initial part: DAVID.mertz@gnosis.cx
                 [A-Za-z0-9-]+ # definitely some local user: MERTZ@gnosis.cx
                             @ # ...needs an at sign in the middle
                  (\w+\.?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
             (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
                             ) # end of match group
                               ''')
    
        for url in re.findall(pat_url, text):
           text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})
    
        for email in re.findall(pat_email, text):
           text = text.replace(email[1], '<a href="mailto:%(email)s">%(email)s</a>' % {"email" : email[1]})
    
        return text
    
    if __name__ == '__main__':
        print fix_urls("test http://google.com asdasdasd some more text")
    

    EDIT: Adjusted to your needs

    0 讨论(0)
提交回复
热议问题