Getting parts of a URL (Regex)

后端 未结 26 2199
说谎
说谎 2020-11-22 02:13

Given the URL (single line):
http://test.example.com/dir/subdir/file.html

How can I extract the following parts using regular expressions:

  1. The Subd
相关标签:
26条回答
  • 2020-11-22 02:45

    I was trying to solve this in javascript, which should be handled by:

    var url = new URL('http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang');
    

    since (in Chrome, at least) it parses to:

    {
      "hash": "#foobar/bing/bo@ng?bang",
      "search": "?foo=bar&bingobang=&king=kong@kong.com",
      "pathname": "/path/wah@t/foo.js",
      "port": "890",
      "hostname": "example.com",
      "host": "example.com:890",
      "password": "b",
      "username": "a",
      "protocol": "http:",
      "origin": "http://example.com:890",
      "href": "http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang"
    }
    

    However, this isn't cross browser (https://developer.mozilla.org/en-US/docs/Web/API/URL), so I cobbled this together to pull the same parts out as above:

    ^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:@\/#\?]+)(?:\:([^:@\/#\?]*))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?
    

    Credit for this regex goes to https://gist.github.com/rpflorence who posted this jsperf http://jsperf.com/url-parsing (originally found here: https://gist.github.com/jlong/2428561#comment-310066) who came up with the regex this was originally based on.

    The parts are in this order:

    var keys = [
        "href",                    // http://user:pass@host.com:81/directory/file.ext?query=1#anchor
        "origin",                  // http://user:pass@host.com:81
        "protocol",                // http:
        "username",                // user
        "password",                // pass
        "host",                    // host.com:81
        "hostname",                // host.com
        "port",                    // 81
        "pathname",                // /directory/file.ext
        "search",                  // ?query=1
        "hash"                     // #anchor
    ];
    

    There is also a small library which wraps it and provides query params:

    https://github.com/sadams/lite-url (also available on bower)

    If you have an improvement, please create a pull request with more tests and I will accept and merge with thanks.

    0 讨论(0)
  • 2020-11-22 02:45

    This improved version should work as reliably as a parser.

       // Applies to URI, not just URL or URN:
       //    http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
       //
       // http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
       //
       // (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
       //
       // http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
       //
       // $@ matches the entire uri
       // $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
       // $2 matches authority (host, user:pwd@host, etc)
       // $3 matches path
       // $4 matches query (http GET REST api, etc)
       // $5 matches fragment (html anchor, etc)
       //
       // Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
       // Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
       //
       // (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
       //
       // Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
       function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
       {
          if( !schemes )
             schemes = '[^\\s:\/?#]+'
          else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
             throw TypeError( 'expected URI schemes' )
          return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
             new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
       }
    
       // http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
       function uriSchemesRegExp()
       {
          return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
       }
    
    0 讨论(0)
  • 2020-11-22 02:45

    I know you're claiming language-agnostic on this, but can you tell us what you're using just so we know what regex capabilities you have?

    If you have the capabilities for non-capturing matches, you can modify hometoast's expression so that subexpressions that you aren't interested in capturing are set up like this:

    (?:SOMESTUFF)

    You'd still have to copy and paste (and slightly modify) the Regex into multiple places, but this makes sense--you're not just checking to see if the subexpression exists, but rather if it exists as part of a URL. Using the non-capturing modifier for subexpressions can give you what you need and nothing more, which, if I'm reading you correctly, is what you want.

    Just as a small, small note, hometoast's expression doesn't need to put brackets around the 's' for 'https', since he only has one character in there. Quantifiers quantify the one character (or character class or subexpression) directly preceding them. So:

    https?

    would match 'http' or 'https' just fine.

    0 讨论(0)
  • 2020-11-22 02:47

    I realize I'm late to the party, but there is a simple way to let the browser parse a url for you without a regex:

    var a = document.createElement('a');
    a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';
    
    ['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
        console.log(k+':', a[k]);
    });
    
    /*//Output:
    href: http://www.example.com:123/foo/bar.html?fox=trot#foo
    protocol: http:
    host: www.example.com:123
    hostname: www.example.com
    port: 123
    pathname: /foo/bar.html
    search: ?fox=trot
    hash: #foo
    */
    
    0 讨论(0)
  • 2020-11-22 02:48

    I tried a few of these that didn't cover my needs, especially the highest voted which didn't catch a url without a path (http://example.com/)

    also lack of group names made it unusable in ansible (or perhaps my jinja2 skills are lacking).

    so this is my version slightly modified with the source being the highest voted version here:

    ^((?P<protocol>http[s]?|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)*\/)([\w\-\.]+[^#?\s]+))*(.*)?(#[\w\-]+)?$
    
    0 讨论(0)
  • 2020-11-22 02:49

    Java offers a URL class that will do this. Query URL Objects.

    On a side note, PHP offers parse_url().

    0 讨论(0)
提交回复
热议问题