I\'ve been looking for a simple regex for URLs, does anybody have one handy that works well? I didn\'t find one with the zend framework validation classes and have seen sev
And there is your answer =) Try to break it, you can't!!!
function link_validate_url($text) {
$LINK_DOMAINS = 'aero|arpa|asia|biz|com|cat|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|mobi|local';
$LINK_ICHARS_DOMAIN = (string) html_entity_decode(implode("", array( // @TODO completing letters ...
"æ", // æ
"Æ", // Æ
"À", // À
"à", // à
"Á", // Á
"á", // á
"Â", // Â
"â", // â
"å", // å
"Å", // Å
"ä", // ä
"Ä", // Ä
"Ç", // Ç
"ç", // ç
"Ð", // Ð
"ð", // ð
"È", // È
"è", // è
"É", // É
"é", // é
"Ê", // Ê
"ê", // ê
"Ë", // Ë
"ë", // ë
"Î", // Î
"î", // î
"Ï", // Ï
"ï", // ï
"ø", // ø
"Ø", // Ø
"ö", // ö
"Ö", // Ö
"Ô", // Ô
"ô", // ô
"Õ", // Õ
"õ", // õ
"Œ", // Œ
"œ", // œ
"ü", // ü
"Ü", // Ü
"Ù", // Ù
"ù", // ù
"Û", // Û
"û", // û
"Ÿ", // Ÿ
"ÿ", // ÿ
"Ñ", // Ñ
"ñ", // ñ
"þ", // þ
"Þ", // Þ
"ý", // ý
"Ý", // Ý
"¿", // ¿
)), ENT_QUOTES, 'UTF-8');
$LINK_ICHARS = $LINK_ICHARS_DOMAIN . (string) html_entity_decode(implode("", array(
"ß", // ß
)), ENT_QUOTES, 'UTF-8');
$allowed_protocols = array('http', 'https', 'ftp', 'news', 'nntp', 'telnet', 'mailto', 'irc', 'ssh', 'sftp', 'webcal');
// Starting a parenthesis group with (?: means that it is grouped, but is not captured
$protocol = '((?:'. implode("|", $allowed_protocols) .'):\/\/)';
$authentication = "(?:(?:(?:[\w\.\-\+!$&'\(\)*\+,;=" . $LINK_ICHARS . "]|%[0-9a-f]{2})+(?::(?:[\w". $LINK_ICHARS ."\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})*)?)?@)";
$domain = '(?:(?:[a-z0-9' . $LINK_ICHARS_DOMAIN . ']([a-z0-9'. $LINK_ICHARS_DOMAIN . '\-_\[\]])*)(\.(([a-z0-9' . $LINK_ICHARS_DOMAIN . '\-_\[\]])+\.)*('. $LINK_DOMAINS .'|[a-z]{2}))?)';
$ipv4 = '(?:[0-9]{1,3}(\.[0-9]{1,3}){3})';
$ipv6 = '(?:[0-9a-fA-F]{1,4}(\:[0-9a-fA-F]{1,4}){7})';
$port = '(?::([0-9]{1,5}))';
// Pattern specific to external links.
$external_pattern = '/^'. $protocol .'?'. $authentication .'?('. $domain .'|'. $ipv4 .'|'. $ipv6 .' |localhost)'. $port .'?';
// Pattern specific to internal links.
$internal_pattern = "/^(?:[a-z0-9". $LINK_ICHARS ."_\-+\[\]]+)";
$internal_pattern_file = "/^(?:[a-z0-9". $LINK_ICHARS ."_\-+\[\]\.]+)$/i";
$directories = "(?:\/[a-z0-9". $LINK_ICHARS ."_\-\.~+%=&,$'#!():;*@\[\]]*)*";
// Yes, four backslashes == a single backslash.
$query = "(?:\/?\?([?a-z0-9". $LINK_ICHARS ."+_|\-\.~\/\\\\%=&,$'():;*@\[\]{} ]*))";
$anchor = "(?:#[a-z0-9". $LINK_ICHARS ."_\-\.~+%=&,$'():;*@\[\]\/\?]*)";
// The rest of the path for a standard URL.
$end = $directories .'?'. $query .'?'. $anchor .'?'.'$/i';
$message_id = '[^@].*@'. $domain;
$newsgroup_name = '(?:[0-9a-z+-]*\.)*[0-9a-z+-]*';
$news_pattern = '/^news:('. $newsgroup_name .'|'. $message_id .')$/i';
$user = '[a-zA-Z0-9'. $LINK_ICHARS .'_\-\.\+\^!#\$%&*+\/\=\?\`\|\{\}~\'\[\]]+';
$email_pattern = '/^mailto:'. $user .'@'.'(?:'. $domain .'|'. $ipv4 .'|'. $ipv6 .'|localhost)'. $query .'?$/';
if (strpos($text, '<front>') === 0) {
return false;
}
if (in_array('mailto', $allowed_protocols) && preg_match($email_pattern, $text)) {
return false;
}
if (in_array('news', $allowed_protocols) && preg_match($news_pattern, $text)) {
return false;
}
if (preg_match($internal_pattern . $end, $text)) {
return false;
}
if (preg_match($external_pattern . $end, $text)) {
return false;
}
if (preg_match($internal_pattern_file, $text)) {
return false;
}
return true;
}
function is_valid_url ($url="") {
if ($url=="") {
$url=$this->url;
}
$url = @parse_url($url);
if ( ! $url) {
return false;
}
$url = array_map('trim', $url);
$url['port'] = (!isset($url['port'])) ? 80 : (int)$url['port'];
$path = (isset($url['path'])) ? $url['path'] : '';
if ($path == '') {
$path = '/';
}
$path .= ( isset ( $url['query'] ) ) ? "?$url[query]" : '';
if ( isset ( $url['host'] ) AND $url['host'] != gethostbyname ( $url['host'] ) ) {
if ( PHP_VERSION >= 5 ) {
$headers = get_headers("$url[scheme]://$url[host]:$url[port]$path");
}
else {
$fp = fsockopen($url['host'], $url['port'], $errno, $errstr, 30);
if ( ! $fp ) {
return false;
}
fputs($fp, "HEAD $path HTTP/1.1\r\nHost: $url[host]\r\n\r\n");
$headers = fread ( $fp, 128 );
fclose ( $fp );
}
$headers = ( is_array ( $headers ) ) ? implode ( "\n", $headers ) : $headers;
return ( bool ) preg_match ( '#^HTTP/.*\s+[(200|301|302)]+\s#i', $headers );
}
return false;
}
Use the filter_var()
function to validate whether a string is URL or not:
var_dump(filter_var('example.com', FILTER_VALIDATE_URL));
It is bad practice to use regular expressions when not necessary.
EDIT: Be careful, this solution is not unicode-safe and not XSS-safe. If you need a complex validation, maybe it's better to look somewhere else.
I used this on a few projects, I don't believe I've run into issues, but I'm sure it's not exhaustive:
$text = preg_replace(
'#((https?|ftp)://(\S*?\.\S*?))([\s)\[\]{},;"\':<]|\.\s|$)#i',
"'<a href=\"$1\" target=\"_blank\">$3</a>$4'",
$text
);
Most of the random junk at the end is to deal with situations like http://domain.com.
in a sentence (to avoid matching the trailing period). I'm sure it could be cleaned up but since it worked. I've more or less just copied it over from project to project.
Here's a simple class for URL Validation using RegEx and then cross-references the domain against popular RBL (Realtime Blackhole Lists) servers:
Install:
require 'URLValidation.php';
Usage:
require 'URLValidation.php';
$urlVal = new UrlValidation(); //Create Object Instance
Add a URL as the parameter of the domain()
method and check the the return.
$urlArray = ['http://www.bokranzr.com/test.php?test=foo&test=dfdf', 'https://en-gb.facebook.com', 'https://www.google.com'];
foreach ($urlArray as $k=>$v) {
echo var_dump($urlVal->domain($v)) . ' URL: ' . $v . '<br>';
}
Output:
bool(false) URL: http://www.bokranzr.com/test.php?test=foo&test=dfdf
bool(true) URL: https://en-gb.facebook.com
bool(true) URL: https://www.google.com
As you can see above, www.bokranzr.com is listed as malicious website via an RBL so the domain was returned as false.
"/(http(s?):\/\/)([a-z0-9\-]+\.)+[a-z]{2,4}(\.[a-z]{2,4})*(\/[^ ]+)*/i"
(http(s?)://) means http:// or https://
([a-z0-9-]+.)+ => 2.0[a-z0-9-] means any a-z character or any 0-9 or (-)sign)
2.1 (+) means the character can be one or more ex: a1w,
a9-,c559s, f)
2.2 \. is (.)sign
2.3. the (+) sign after ([a-z0-9\-]+\.) mean do 2.1,2.2,2.3
at least 1 time
ex: abc.defgh0.ig, aa.b.ced.f.gh. also in case www.yyy.com
3.[a-z]{2,4} mean a-z at least 2 character but not more than
4 characters for check that there will not be
the case
ex: https://www.google.co.kr.asdsdagfsdfsf
4.(\.[a-z]{2,4})*(\/[^ ]+)* mean
4.1 \.[a-z]{2,4} means like number 3 but start with
(.)sign
4.2 * means (\.[a-z]{2,4})can be use or not use never mind
4.3 \/ means \
4.4 [^ ] means any character except blank
4.5 (+) means do 4.3,4.4,4.5 at least 1 times
4.6 (*) after (\/[^ ]+) mean use 4.3 - 4.5 or not use
no problem
use for case https://stackoverflow.com/posts/51441301/edit
5. when you use regex write in "/ /" so it come
"/(http(s?)://)([a-z0-9-]+.)+[a-z]{2,4}(.[a-z]{2,4})(/[^ ]+)/i"
6. almost forgot: letter i on the back mean ignore case of
Big letter or small letter ex: A same as a, SoRRy same
as sorry.
Note : Sorry for bad English. My country not use it well.