PHP validation/regex for URL

后端 未结 21 2082
青春惊慌失措
青春惊慌失措 2020-11-22 01:19

I\'ve been looking for a simple regex for URLs, does anybody have one handy that works well? I didn\'t find one with the zend framework validation classes and have seen sev

相关标签:
21条回答
  • 2020-11-22 01:49

    And there is your answer =) Try to break it, you can't!!!

    function link_validate_url($text) {
    $LINK_DOMAINS = 'aero|arpa|asia|biz|com|cat|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|mobi|local';
      $LINK_ICHARS_DOMAIN = (string) html_entity_decode(implode("", array( // @TODO completing letters ...
        "æ", // æ
        "Æ", // Æ
        "À", // À
        "à", // à
        "Á", // Á
        "á", // á
        "Â", // Â
        "â", // â
        "å", // å
        "Å", // Å
        "ä", // ä
        "Ä", // Ä
        "Ç", // Ç
        "ç", // ç
        "Ð", // Ð
        "ð", // ð
        "È", // È
        "è", // è
        "É", // É
        "é", // é
        "Ê", // Ê
        "ê", // ê
        "Ë", // Ë
        "ë", // ë
        "Î", // Î
        "î", // î
        "Ï", // Ï
        "ï", // ï
        "ø", // ø
        "Ø", // Ø
        "ö", // ö
        "Ö", // Ö
        "Ô", // Ô
        "ô", // ô
        "Õ", // Õ
        "õ", // õ
        "Œ", // Œ
        "œ", // œ
        "ü", // ü
        "Ü", // Ü
        "Ù", // Ù
        "ù", // ù
        "Û", // Û
        "û", // û
        "Ÿ", // Ÿ
        "ÿ", // ÿ 
        "Ñ", // Ñ
        "ñ", // ñ
        "þ", // þ
        "Þ", // Þ
        "ý", // ý
        "Ý", // Ý
        "¿", // ¿
      )), ENT_QUOTES, 'UTF-8');
    
      $LINK_ICHARS = $LINK_ICHARS_DOMAIN . (string) html_entity_decode(implode("", array(
        "ß", // ß
      )), ENT_QUOTES, 'UTF-8');
      $allowed_protocols = array('http', 'https', 'ftp', 'news', 'nntp', 'telnet', 'mailto', 'irc', 'ssh', 'sftp', 'webcal');
    
      // Starting a parenthesis group with (?: means that it is grouped, but is not captured
      $protocol = '((?:'. implode("|", $allowed_protocols) .'):\/\/)';
      $authentication = "(?:(?:(?:[\w\.\-\+!$&'\(\)*\+,;=" . $LINK_ICHARS . "]|%[0-9a-f]{2})+(?::(?:[\w". $LINK_ICHARS ."\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})*)?)?@)";
      $domain = '(?:(?:[a-z0-9' . $LINK_ICHARS_DOMAIN . ']([a-z0-9'. $LINK_ICHARS_DOMAIN . '\-_\[\]])*)(\.(([a-z0-9' . $LINK_ICHARS_DOMAIN . '\-_\[\]])+\.)*('. $LINK_DOMAINS .'|[a-z]{2}))?)';
      $ipv4 = '(?:[0-9]{1,3}(\.[0-9]{1,3}){3})';
      $ipv6 = '(?:[0-9a-fA-F]{1,4}(\:[0-9a-fA-F]{1,4}){7})';
      $port = '(?::([0-9]{1,5}))';
    
      // Pattern specific to external links.
      $external_pattern = '/^'. $protocol .'?'. $authentication .'?('. $domain .'|'. $ipv4 .'|'. $ipv6 .' |localhost)'. $port .'?';
    
      // Pattern specific to internal links.
      $internal_pattern = "/^(?:[a-z0-9". $LINK_ICHARS ."_\-+\[\]]+)";
      $internal_pattern_file = "/^(?:[a-z0-9". $LINK_ICHARS ."_\-+\[\]\.]+)$/i";
    
      $directories = "(?:\/[a-z0-9". $LINK_ICHARS ."_\-\.~+%=&,$'#!():;*@\[\]]*)*";
      // Yes, four backslashes == a single backslash.
      $query = "(?:\/?\?([?a-z0-9". $LINK_ICHARS ."+_|\-\.~\/\\\\%=&,$'():;*@\[\]{} ]*))";
      $anchor = "(?:#[a-z0-9". $LINK_ICHARS ."_\-\.~+%=&,$'():;*@\[\]\/\?]*)";
    
      // The rest of the path for a standard URL.
      $end = $directories .'?'. $query .'?'. $anchor .'?'.'$/i';
    
      $message_id = '[^@].*@'. $domain;
      $newsgroup_name = '(?:[0-9a-z+-]*\.)*[0-9a-z+-]*';
      $news_pattern = '/^news:('. $newsgroup_name .'|'. $message_id .')$/i';
    
      $user = '[a-zA-Z0-9'. $LINK_ICHARS .'_\-\.\+\^!#\$%&*+\/\=\?\`\|\{\}~\'\[\]]+';
      $email_pattern = '/^mailto:'. $user .'@'.'(?:'. $domain .'|'. $ipv4 .'|'. $ipv6 .'|localhost)'. $query .'?$/';
    
      if (strpos($text, '<front>') === 0) {
        return false;
      }
      if (in_array('mailto', $allowed_protocols) && preg_match($email_pattern, $text)) {
        return false;
      }
      if (in_array('news', $allowed_protocols) && preg_match($news_pattern, $text)) {
        return false;
      }
      if (preg_match($internal_pattern . $end, $text)) {
        return false;
      }
      if (preg_match($external_pattern . $end, $text)) {
        return false;
      }
      if (preg_match($internal_pattern_file, $text)) {
        return false;
      }
    
      return true;
    }
    
    0 讨论(0)
  • 2020-11-22 01:50
    function is_valid_url ($url="") {
    
            if ($url=="") {
                $url=$this->url;
            }
    
            $url = @parse_url($url);
    
            if ( ! $url) {
    
    
                return false;
            }
    
            $url = array_map('trim', $url);
            $url['port'] = (!isset($url['port'])) ? 80 : (int)$url['port'];
            $path = (isset($url['path'])) ? $url['path'] : '';
    
            if ($path == '') {
                $path = '/';
            }
    
            $path .= ( isset ( $url['query'] ) ) ? "?$url[query]" : '';
    
    
    
            if ( isset ( $url['host'] ) AND $url['host'] != gethostbyname ( $url['host'] ) ) {
                if ( PHP_VERSION >= 5 ) {
                    $headers = get_headers("$url[scheme]://$url[host]:$url[port]$path");
                }
                else {
                    $fp = fsockopen($url['host'], $url['port'], $errno, $errstr, 30);
    
                    if ( ! $fp ) {
                        return false;
                    }
                    fputs($fp, "HEAD $path HTTP/1.1\r\nHost: $url[host]\r\n\r\n");
                    $headers = fread ( $fp, 128 );
                    fclose ( $fp );
                }
                $headers = ( is_array ( $headers ) ) ? implode ( "\n", $headers ) : $headers;
                return ( bool ) preg_match ( '#^HTTP/.*\s+[(200|301|302)]+\s#i', $headers );
            }
    
            return false;
        }
    
    0 讨论(0)
  • 2020-11-22 01:51

    Use the filter_var() function to validate whether a string is URL or not:

    var_dump(filter_var('example.com', FILTER_VALIDATE_URL));
    

    It is bad practice to use regular expressions when not necessary.

    EDIT: Be careful, this solution is not unicode-safe and not XSS-safe. If you need a complex validation, maybe it's better to look somewhere else.

    0 讨论(0)
  • 2020-11-22 01:52

    I used this on a few projects, I don't believe I've run into issues, but I'm sure it's not exhaustive:

    $text = preg_replace(
      '#((https?|ftp)://(\S*?\.\S*?))([\s)\[\]{},;"\':<]|\.\s|$)#i',
      "'<a href=\"$1\" target=\"_blank\">$3</a>$4'",
      $text
    );
    

    Most of the random junk at the end is to deal with situations like http://domain.com. in a sentence (to avoid matching the trailing period). I'm sure it could be cleaned up but since it worked. I've more or less just copied it over from project to project.

    0 讨论(0)
  • 2020-11-22 01:53

    Here's a simple class for URL Validation using RegEx and then cross-references the domain against popular RBL (Realtime Blackhole Lists) servers:

    Install:

    require 'URLValidation.php';
    

    Usage:

    require 'URLValidation.php';
    $urlVal = new UrlValidation(); //Create Object Instance
    

    Add a URL as the parameter of the domain() method and check the the return.

    $urlArray = ['http://www.bokranzr.com/test.php?test=foo&test=dfdf', 'https://en-gb.facebook.com', 'https://www.google.com'];
    foreach ($urlArray as $k=>$v) {
    
        echo var_dump($urlVal->domain($v)) . ' URL: ' . $v . '<br>';
    
    }
    

    Output:

    bool(false) URL: http://www.bokranzr.com/test.php?test=foo&test=dfdf
    bool(true) URL: https://en-gb.facebook.com
    bool(true) URL: https://www.google.com
    

    As you can see above, www.bokranzr.com is listed as malicious website via an RBL so the domain was returned as false.

    0 讨论(0)
  • 2020-11-22 01:54
    "/(http(s?):\/\/)([a-z0-9\-]+\.)+[a-z]{2,4}(\.[a-z]{2,4})*(\/[^ ]+)*/i"
    
    1. (http(s?)://) means http:// or https://

    2. ([a-z0-9-]+.)+ => 2.0[a-z0-9-] means any a-z character or any 0-9 or (-)sign)

                   2.1 (+) means the character can be one or more ex: a1w, 
                       a9-,c559s, f)
      
                   2.2 \. is (.)sign
      
                   2.3. the (+) sign after ([a-z0-9\-]+\.) mean do 2.1,2.2,2.3 
                      at least 1 time 
                    ex: abc.defgh0.ig, aa.b.ced.f.gh. also in case www.yyy.com
      
                   3.[a-z]{2,4} mean a-z at least 2 character but not more than 
                                4 characters for check that there will not be 
                                the case 
                                ex: https://www.google.co.kr.asdsdagfsdfsf
      
                   4.(\.[a-z]{2,4})*(\/[^ ]+)* mean 
      
                     4.1 \.[a-z]{2,4} means like number 3 but start with 
                         (.)sign 
      
                     4.2 * means (\.[a-z]{2,4})can be use or not use never mind
      
                     4.3 \/ means \
                     4.4 [^ ] means any character except blank
                     4.5 (+) means do 4.3,4.4,4.5 at least 1 times
                     4.6 (*) after (\/[^ ]+) mean use 4.3 - 4.5 or not use 
                         no problem
      
                     use for case https://stackoverflow.com/posts/51441301/edit
      
                     5. when you use regex write in "/ /" so it come
      

      "/(http(s?)://)([a-z0-9-]+.)+[a-z]{2,4}(.[a-z]{2,4})(/[^ ]+)/i"

                     6. almost forgot: letter i on the back mean ignore case of 
                        Big letter or small letter ex: A same as a, SoRRy same 
                        as sorry.
      

    Note : Sorry for bad English. My country not use it well.

    0 讨论(0)
提交回复
热议问题