Getting parts of a URL (Regex)

后端未结

关注

 26  2223

说谎

Given the URL (single line):
http://test.example.com/dir/subdir/file.html

How can I extract the following parts using regular expressions:

The Subd

相关标签:

26条回答

佛祖请我去吃肉

2020-11-22 03:00

I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:

It can not handle port number.
The hash part is broken.

The following is a modified version:

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$

Position of parts are as follows:

int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

Edit posted by anon user:

function getFileName(path) {
    return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}

0 讨论(0)

一整个雨季

2020-11-22 03:00
```
String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";

String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";

System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));
```
Will provide the following output:
1: https://
2: www.thomas-bayer.com
3: /
4: axis2/services/BLZService?wsdl

If you change the URL to
String s = "https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888"; the output will be the following :
1: https://
2: www.thomas-bayer.com
3: ?
4: wsdl=qwerwer&ttt=888

enjoy..
Yosi Lev
0 讨论(0)
发布评论:

提交评论
- 加载中...

北荒

2020-11-22 03:01

The regex to do full parsing is quite horrendous. I've included named backreferences for legibility, and broken each part into separate lines, but it still looks like this:

^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$

The thing that requires it to be so verbose is that except for the protocol or the port, any of the parts can contain HTML entities, which makes delineation of the fragment quite tricky. So in the last few cases - the host, path, file, querystring, and fragment, we allow either any html entity or any character that isn't a ? or #. The regex for an html entity looks like this:

$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"

When that is extracted (I used a mustache syntax to represent it), it becomes a bit more legible:

^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$

In JavaScript, of course, you can't use named backreferences, so the regex becomes

^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$

and in each match, the protocol is \1, the host is \2, the port is \3, the path \4, the file \5, the querystring \6, and the fragment \7.

0 讨论(0)

谎友^

2020-11-22 03:02
I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:
```
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9
```
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:
```
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
```
For what it's worth, I found that I had to escape the forward slashes in JavaScript:

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
0 讨论(0)
发布评论:

提交评论
- 加载中...
夕颜

2020-11-22 03:03

Using http://www.fileformat.info/tool/regex.htm hometoast's regex works great.

But here is the deal, I want to use different regex patterns in different situations in my program.

For example, I have this URL, and I have an enumeration that lists all supported URLs in my program. Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL. If the particular regex pattern returns true, then I know that this URL is supported by my program. So, each enumeration has it's own regex depending on where it should look inside the URL.

Hometoast's suggestion is great, but in my case, I think it wouldn't help (unless I copy paste the same regex in all enumerations).

That is why I wanted the answer to give the regex for each situation separately. Although +1 for hometoast. ;)

0 讨论(0)
发布评论:

提交评论
- 加载中...
南笙

2020-11-22 03:05
Try the following:
```
^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
```
It supports HTTP / FTP, subdomains, folders, files etc.

I found it from a quick google search:

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx
0 讨论(0)
发布评论:

提交评论
- 加载中...