What is the RFC compliant and working regular expression to check if a string is a valid URL

前端 未结 4 1380
无人及你
无人及你 2020-12-06 10:04

There is question by the almost the same name already: What is the best regular expression to check if a string is a valid URL

I don\'t understand this stackoverflow

相关标签:
4条回答
  • 2020-12-06 10:26

    After reading RFC 3986, I have to say I was wrong. That regexp is fully working (that I know). First mistake I had was syntax of IPv6 addresesses, they are put around [], and second was about example.org: (note trailing double dot :). But as the RFC says scheme can have dots in it, so it's also valid.

    So that's valid RFC way to do it, but people will usually (as I will) need to modify it to only accept some schemas.

    0 讨论(0)
  • 2020-12-06 10:29

    Well, if you look at it, the specification is broken down into "chunks". That's how I'd suggest building the regex so that it's easier to read, more maintainable and understandable. So, the parts of the regex are (Optional are italicized):

    1. Scheme
    2. Username/Password
    3. Domain Or IP Address
    4. Port
    5. Path
    6. Query
    7. Anchor

    So, we need to build a regex sub-part for each.

    1. Scheme:

      $scheme = "[a-z][a-z0-9+.-]*";
      
    2. Username/Password:

      $username = "([^:@/](:[^:@/])?@)?";
      
    3. Domain or IP Address:

      Now, we need to build up the 3 possible hosts:

      1. Domain Name
      2. IPv4
      3. IPv6

      Domain Name:

      $segment = "([a-z][a-z0-9-]*?[a-z0-9])";
      $domain = "({$segment}\.)*{$segment}";
      

      IPv4:

      $segment = "([0|1][0-9]{2}|2([0-4][0-9]|5[0-5]))";
      $ipv4 = "({$segment}\.{$segment}\.{$segment}\.{$segment})";
      

      IPv6:

      $block = "([a-f0-9]{0,4})";
      $rawIpv6 = "({$block}:){2,8}";
      $ipv4sub = "(::ffff:{$ipv4})";
      $ipv6 = "([({$rawIpv6}|{$ipv4sub})])";
      

      Finally:

      $host = "($domain|$ipv4|$ipv6)";
      
    4. Port:

      $port = "(:[\d]{1,5})?";
      
    5. Path:

      $path = "([^?;\#]*)?";
      
    6. Query:

      $query = "(\?[^\#;]*)?";
      
    7. Anchor:

      $anchor = "(\#.*)?";
      

    And the final regex:

    $regex = "#^{$scheme}://{$username}{$host}{$port}(/{$path}{$query}{$anchor}|)$#i";
    

    Note that the / is in the regex, and not the path part since path can be empty.

    Also note that I have not tested this. It should work, but definitely it needs confirming that each part is correct (as for what to expect in the url).

    Also also note that this is only one way of doing it. You could use other tools that don't need regexp or a library or framework that'll be easier to maintain in the long run.

    Best of luck

    0 讨论(0)
  • 2020-12-06 10:38

    Thanks ircmaxell but I had to adjust a little the IPV6 regex for PHP to compile with preg_match.

    I changed:

    $ipv6 = "([({$rawIpv6}|{$ipv4sub})])";
    

    To :

    $ipv6 = "({$rawIpv6}|{$ipv4sub})";
    
    0 讨论(0)
  • 2020-12-06 10:53

    Here's RFC that you can study: RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax. Section 3.2.2 Host is what you're looking for.

    Unfortunately PHP's build-in function filter_var() doesn't support IPv6 syntax:

    <?php
    
    var_dump(filter_var('http://[2620:0:1cfe:face:b00c::3]:80/', FILTER_VALIDATE_URL));
    // Output: boolean false
    
    0 讨论(0)
提交回复
热议问题