I am trying to convert, from a textarea input ($_POST[\'content\']
), all urls to link.
$content = preg_replace(\'!(\\s|^)((https?://)+[a-z0-9_./
Your code as it is should not be much of a problem within iframes and so on, because in there you usually have a "
in front of your URL and not a space, as your pattern requires.
However, here is different solution. It might not work 100% if you have single <
or >
within HTML comments or something similar. But in any other case, it should server you well (and I do not whether this is a problem for you or not). It uses a negative lookahead to make sure that there is no closing >
before any opening <
(because this means, you are inside a tag).
$content = preg_replace('$(\s|^)(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$2" target="_blank">$2</a> ', $content." ");
$content = preg_replace('$(\s|^)(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$2" target="_blank">$2</a> ', $content." ");
In case you are not familiar with this technique, here is a bit more elaboration.
(?! # starts the lookahead assertion; now your pattern will only match, if this subpattern does not match
[^<>] # any character that is neither < nor >; the > is not strictly necessary but might help for optimization
* # arbitrary many of those characters (but in a row; so not a single < or > in between)
> # the closing >
) # ends the lookahead subpattern
Note that I changed the regex delimiters, because I am now using !
within the regex.
Unless you need the first subpattern (\s|^)
for the URLs outside of tags as well, you can now remove that, too (and decrease the capture variables in the replacement).
$content = preg_replace('$(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
$content = preg_replace('$(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1" target="_blank">$1</a> ', $content." ");
And lastly... do you intend not to replace URLs that contain anchors at the end? E.g. www.hello.com/index.html#section1
? If you missed this by accident, add the #
to your allowed URL characters:
$content = preg_replace('$(https?://[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
$content = preg_replace('$(www\.[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1" target="_blank">$1</a> ', $content." ");
EDIT: Also, what about +
and %
? There are also a few other characters that are allowed to appear in a URL without being encoded. See this. END OF EDIT
I think this should do the trick for you. However, if you could provide an example that shows working and broken URLs (with the code you have), we could actually provide solutions that are tested to work for all of your cases.
One final thought. The proper solution would be to use a DOM parser. Then you could simply apply the regex you already have only to text nodes. However, your concern for the HTML structure is very restricted, and that makes your problem regular again (as long as you do not have unmatched '<' or '>' in HTML comments or JavaScript or CSS on the page). If you do have those special cases, you should really look into a DOM parser. None of the solutions presented here (so far) will be safe in that case.
https?://
and ends with space or end of the line (vertical space or so called new line).<a href=" http...">
starts with the space, but this is invalid html)./m
tells the regex to match every line (so that matching described in the first point will work).nl2br()
should be used after replacement (because of the links that start on the beginning of the line).<?php
$content =
preg_replace(
'~(\s|^)(https?://.+?)(\s|$)~im',
'$1<a href="$2" target="_blank">$2</a>$3',
$content
);
$content =
preg_replace(
'~(\s|^)(www\..+?)(\s|$)~im',
'$1<a href="http://$2" target="_blank">$2</a>$3',
$content
);
$content = nl2br($content);
Example of links without https?://
prefixes + example of single preg_replace()
call (patterns & replacements are array):
$content =
preg_replace(
array(
'~(\s|^)(www\..+?)(\s|$)~im',
'~(\s|^)(https?://)(.+?)(\s|$)~im',
),
array(
'$1http://$2$3',
'$1<a href="$2$3" target="_blank">$3</a>$4',
),
$content
);
$content = nl2br($content);
Let me suggest something less straight forward: split the input text into the html and non-html parts, then process the non-html parts with your regexp combining the text back into one piece. Smth. like:
<?php
$chunks = preg_split('/(<.*>)/Ums', $_POST['content'], -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$result = '';
foreach ($chunks as $chunk) {
if (substr($chunk,0,1) != '<') {
/* do your processing on $chunk */
}
$result .= $chunk;
}
Some additional advices:
This has been done hundreds of times over before. On this page either m-buettner and glavić work fine although I like glivic's shorter expression.
Here's a good php resource to do it: http://code.iamcal.com/php/lib_autolink/
Repeats on Stackoverflow:
Decent in-depth article: - http://buildinternet.com/2010/05/how-to-automatically-linkify-text-with-php-regular-expressions/