There is question by the almost the same name already: What is the best regular expression to check if a string is a valid URL
I don\'t understand this stackoverflow
After reading RFC 3986, I have to say I was wrong. That regexp is fully working (that I know). First mistake I had was syntax of IPv6 addresesses, they are put around [], and second was about example.org: (note trailing double dot :). But as the RFC says scheme can have dots in it, so it's also valid.
So that's valid RFC way to do it, but people will usually (as I will) need to modify it to only accept some schemas.
Well, if you look at it, the specification is broken down into "chunks". That's how I'd suggest building the regex so that it's easier to read, more maintainable and understandable. So, the parts of the regex are (Optional are italicized):
So, we need to build a regex sub-part for each.
Scheme:
$scheme = "[a-z][a-z0-9+.-]*";
Username/Password:
$username = "([^:@/](:[^:@/])?@)?";
Domain or IP Address:
Now, we need to build up the 3 possible hosts:
Domain Name:
$segment = "([a-z][a-z0-9-]*?[a-z0-9])";
$domain = "({$segment}\.)*{$segment}";
IPv4:
$segment = "([0|1][0-9]{2}|2([0-4][0-9]|5[0-5]))";
$ipv4 = "({$segment}\.{$segment}\.{$segment}\.{$segment})";
IPv6:
$block = "([a-f0-9]{0,4})";
$rawIpv6 = "({$block}:){2,8}";
$ipv4sub = "(::ffff:{$ipv4})";
$ipv6 = "([({$rawIpv6}|{$ipv4sub})])";
Finally:
$host = "($domain|$ipv4|$ipv6)";
Port:
$port = "(:[\d]{1,5})?";
Path:
$path = "([^?;\#]*)?";
Query:
$query = "(\?[^\#;]*)?";
Anchor:
$anchor = "(\#.*)?";
And the final regex:
$regex = "#^{$scheme}://{$username}{$host}{$port}(/{$path}{$query}{$anchor}|)$#i";
Note that the /
is in the regex, and not the path part since path can be empty.
Also note that I have not tested this. It should work, but definitely it needs confirming that each part is correct (as for what to expect in the url).
Also also note that this is only one way of doing it. You could use other tools that don't need regexp or a library or framework that'll be easier to maintain in the long run.
Best of luck
Thanks ircmaxell but I had to adjust a little the IPV6 regex for PHP to compile with preg_match.
I changed:
$ipv6 = "([({$rawIpv6}|{$ipv4sub})])";
To :
$ipv6 = "({$rawIpv6}|{$ipv4sub})";
Here's RFC that you can study: RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax. Section 3.2.2 Host
is what you're looking for.
Unfortunately PHP's build-in function filter_var() doesn't support IPv6 syntax:
<?php
var_dump(filter_var('http://[2620:0:1cfe:face:b00c::3]:80/', FILTER_VALIDATE_URL));
// Output: boolean false