Getting parts of a URL (Regex)

后端 未结 26 2131
说谎
说谎 2020-11-22 02:13

Given the URL (single line):
http://test.example.com/dir/subdir/file.html

How can I extract the following parts using regular expressions:

  1. The Subd
相关标签:
26条回答
  • 2020-11-22 03:00

    I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:

    1. It can not handle port number.
    2. The hash part is broken.

    The following is a modified version:

    ^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$
    

    Position of parts are as follows:

    int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12
    

    Edit posted by anon user:

    function getFileName(path) {
        return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
    }
    
    0 讨论(0)
  • 2020-11-22 03:00
    String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";
    
    String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";
    
    System.out.println("1: " + s.replaceAll(regex, "$1"));
    System.out.println("2: " + s.replaceAll(regex, "$2"));
    System.out.println("3: " + s.replaceAll(regex, "$3"));
    System.out.println("4: " + s.replaceAll(regex, "$4"));
    

    Will provide the following output:
    1: https://
    2: www.thomas-bayer.com
    3: /
    4: axis2/services/BLZService?wsdl

    If you change the URL to
    String s = "https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888"; the output will be the following :
    1: https://
    2: www.thomas-bayer.com
    3: ?
    4: wsdl=qwerwer&ttt=888

    enjoy..
    Yosi Lev

    0 讨论(0)
  • 2020-11-22 03:01

    The regex to do full parsing is quite horrendous. I've included named backreferences for legibility, and broken each part into separate lines, but it still looks like this:

    ^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
    (?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
    (?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
    (?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
    (?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
    (?:#(?P<fragment>.*))?$
    

    The thing that requires it to be so verbose is that except for the protocol or the port, any of the parts can contain HTML entities, which makes delineation of the fragment quite tricky. So in the last few cases - the host, path, file, querystring, and fragment, we allow either any html entity or any character that isn't a ? or #. The regex for an html entity looks like this:

    $htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"
    

    When that is extracted (I used a mustache syntax to represent it), it becomes a bit more legible:

    ^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
    (?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
    (?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
    (?P<file>(?:{{htmlentity}}|[^?#])+)
    (?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
    (?:#(?P<fragment>.*))?$
    

    In JavaScript, of course, you can't use named backreferences, so the regex becomes

    ^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$
    

    and in each match, the protocol is \1, the host is \2, the port is \3, the path \4, the file \5, the querystring \6, and the fragment \7.

    0 讨论(0)
  • 2020-11-22 03:02

    I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:

    ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
     12            3  4          5       6  7        8 9
    

    The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

    http://www.ics.uci.edu/pub/ietf/uri/#Related

    results in the following subexpression matches:

    $1 = http:
    $2 = http
    $3 = //www.ics.uci.edu
    $4 = www.ics.uci.edu
    $5 = /pub/ietf/uri/
    $6 = <undefined>
    $7 = <undefined>
    $8 = #Related
    $9 = Related
    

    For what it's worth, I found that I had to escape the forward slashes in JavaScript:

    ^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

    0 讨论(0)
  • 2020-11-22 03:03

    Using http://www.fileformat.info/tool/regex.htm hometoast's regex works great.

    But here is the deal, I want to use different regex patterns in different situations in my program.

    For example, I have this URL, and I have an enumeration that lists all supported URLs in my program. Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL. If the particular regex pattern returns true, then I know that this URL is supported by my program. So, each enumeration has it's own regex depending on where it should look inside the URL.

    Hometoast's suggestion is great, but in my case, I think it wouldn't help (unless I copy paste the same regex in all enumerations).

    That is why I wanted the answer to give the regex for each situation separately. Although +1 for hometoast. ;)

    0 讨论(0)
  • 2020-11-22 03:05

    Try the following:

    ^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
    

    It supports HTTP / FTP, subdomains, folders, files etc.

    I found it from a quick google search:

    http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

    0 讨论(0)
提交回复
热议问题