convert url to links from string except if they are in an attribute of an html tag

后端 未结 4 1956
我在风中等你
我在风中等你 2020-12-16 01:21

I am trying to convert, from a textarea input ($_POST[\'content\']), all urls to link.

$content = preg_replace(\'!(\\s|^)((https?://)+[a-z0-9_./         


        
相关标签:
4条回答
  • 2020-12-16 01:59

    Your code as it is should not be much of a problem within iframes and so on, because in there you usually have a " in front of your URL and not a space, as your pattern requires.

    However, here is different solution. It might not work 100% if you have single < or > within HTML comments or something similar. But in any other case, it should server you well (and I do not whether this is a problem for you or not). It uses a negative lookahead to make sure that there is no closing > before any opening < (because this means, you are inside a tag).

    $content = preg_replace('$(\s|^)(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$2" target="_blank">$2</a> ', $content." ");
    $content = preg_replace('$(\s|^)(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$2"  target="_blank">$2</a> ', $content." ");
    

    In case you are not familiar with this technique, here is a bit more elaboration.

    (?!        # starts the lookahead assertion; now your pattern will only match, if this subpattern does not match
    [^<>]      # any character that is neither < nor >; the > is not strictly necessary but might help for optimization
    *          # arbitrary many of those characters (but in a row; so not a single < or > in between)
    >          # the closing >
    )          # ends the lookahead subpattern
    

    Note that I changed the regex delimiters, because I am now using ! within the regex.

    Unless you need the first subpattern (\s|^) for the URLs outside of tags as well, you can now remove that, too (and decrease the capture variables in the replacement).

    $content = preg_replace('$(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
    $content = preg_replace('$(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1"  target="_blank">$1</a> ', $content." ");
    

    And lastly... do you intend not to replace URLs that contain anchors at the end? E.g. www.hello.com/index.html#section1? If you missed this by accident, add the # to your allowed URL characters:

    $content = preg_replace('$(https?://[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
    $content = preg_replace('$(www\.[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1"  target="_blank">$1</a> ', $content." ");
    

    EDIT: Also, what about + and %? There are also a few other characters that are allowed to appear in a URL without being encoded. See this. END OF EDIT

    I think this should do the trick for you. However, if you could provide an example that shows working and broken URLs (with the code you have), we could actually provide solutions that are tested to work for all of your cases.

    One final thought. The proper solution would be to use a DOM parser. Then you could simply apply the regex you already have only to text nodes. However, your concern for the HTML structure is very restricted, and that makes your problem regular again (as long as you do not have unmatched '<' or '>' in HTML comments or JavaScript or CSS on the page). If you do have those special cases, you should really look into a DOM parser. None of the solutions presented here (so far) will be safe in that case.

    0 讨论(0)
  • 2020-12-16 02:00
    1. In my opinion url is everything that starts with https?:// and ends with space or end of the line (vertical space or so called new line).
    2. Because of the first point, images, links etc. will not be replaced, because they all start with " or > (except if link <a href=" http..."> starts with the space, but this is invalid html).
    3. Modifier /m tells the regex to match every line (so that matching described in the first point will work).
    4. Function nl2br() should be used after replacement (because of the links that start on the beginning of the line).
    5. Space before and after are added only if space originally exists in the $content (see $1 and $3 in the second parameter of the preg_replace() function).
    6. This solution supports domain names with special characters, like www.moški.si.

    Input:

    INPUT

    Code:

    <?php
    
    $content =
        preg_replace(
            '~(\s|^)(https?://.+?)(\s|$)~im', 
            '$1<a href="$2" target="_blank">$2</a>$3', 
            $content
        );
    $content = 
        preg_replace(
            '~(\s|^)(www\..+?)(\s|$)~im', 
            '$1<a href="http://$2" target="_blank">$2</a>$3', 
            $content
        );
    $content = nl2br($content);
    

    Output:

    Output

    Edit:

    Example of links without https?:// prefixes + example of single preg_replace() call (patterns & replacements are array):

    $content = 
        preg_replace(
            array(
                '~(\s|^)(www\..+?)(\s|$)~im', 
                '~(\s|^)(https?://)(.+?)(\s|$)~im', 
            ),
            array(
                '$1http://$2$3', 
                '$1<a href="$2$3" target="_blank">$3</a>$4', 
            ),
            $content
        );
    $content = nl2br($content);
    

    enter image description here

    0 讨论(0)
  • 2020-12-16 02:12

    Let me suggest something less straight forward: split the input text into the html and non-html parts, then process the non-html parts with your regexp combining the text back into one piece. Smth. like:

      <?php
      $chunks = preg_split('/(<.*>)/Ums', $_POST['content'], -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
      $result = '';
      foreach ($chunks as $chunk) {
        if (substr($chunk,0,1) != '<') {
          /* do your processing on $chunk */
        }
        $result .= $chunk;
      }
    

    Some additional advices:

    1. try to save the source text and do the transformation when displaying it. This will allow you to improve/fix your rendering code if in future you find a new problem/idea.
    2. (https?://)+ shouldn't be in brackets and you don't need +, cause it matches "https://https://some.com" - just put https?://[a-z0-9_./?=&-]+
    3. the same about (www.)+ :)
    0 讨论(0)
  • 2020-12-16 02:14

    This has been done hundreds of times over before. On this page either m-buettner and glavić work fine although I like glivic's shorter expression.

    Here's a good php resource to do it: http://code.iamcal.com/php/lib_autolink/

    Repeats on Stackoverflow:

    • How do I linkify urls in a string with php?
    • PHP Linkify Links In Content

    Decent in-depth article: - http://buildinternet.com/2010/05/how-to-automatically-linkify-text-with-php-regular-expressions/

    0 讨论(0)
提交回复
热议问题