Getting parts of a URL (Regex)

后端未结

关注

 26  2225

说谎

Given the URL (single line):
http://test.example.com/dir/subdir/file.html

How can I extract the following parts using regular expressions:

The Subd

相关标签:

26条回答

萌比男神i

2020-11-22 03:06
subdomain and domain are difficult because the subdomain can have several parts, as can the top level domain, http://sub1.sub2.domain.co.uk/
```
 the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)  
 the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$  
 the path with the file : http://[^/]+/(.*)  
 the URL without the path : (http://[^/]+/)  
```
(Markdown isn't very friendly to regexes)
0 讨论(0)
发布评论:

提交评论
- 加载中...

独厮守ぢ

2020-11-22 03:09

Propose a much more readable solution (in Python, but applies to any regex):

def url_path_to_dict(path):
    pattern = (r'^'
               r'((?P<schema>.+?)://)?'
               r'((?P<user>.+?)(:(?P<password>.*?))?@)?'
               r'(?P<host>.*?)'
               r'(:(?P<port>\d+?))?'
               r'(?P<path>/.*?)?'
               r'(?P<query>[?].*?)?'
               r'$'
               )
    regex = re.compile(pattern)
    m = regex.match(path)
    d = m.groupdict() if m is not None else None

    return d

def main():
    print url_path_to_dict('http://example.example.com/example/example/example.html')

Prints:

{
'host': 'example.example.com', 
'user': None, 
'path': '/example/example/example.html', 
'query': None, 
'password': None, 
'port': None, 
'schema': 'http'
}

0 讨论(0)

执念已碎

2020-11-22 03:10
```
/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)@)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/
```
From my answer on a similar question. Works better than some of the others mentioned because they had some bugs (such as not supporting username/password, not supporting single-character filenames, fragment identifiers being broken).
0 讨论(0)
发布评论:

提交评论
- 加载中...
自闭症患者

2020-11-22 03:11
A single regex to parse and breakup a full URL including query parameters and anchors e.g.

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

RexEx positions:

url: RegExp['$&'],

protocol:RegExp.$2,

host:RegExp.$3,

path:RegExp.$4,

file:RegExp.$6,

query:RegExp.$7,

hash:RegExp.$8

you could then further parse the host ('.' delimited) quite easily.

What I would do is use something like this:
```
/*
    ^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4
```
the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.
0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2020-11-22 03:11

I like the regex that was published in "Javascript: The Good Parts". Its not too short and not too complex. This page on github also has the JavaScript code that uses it. But it an be adapted for any language. https://gist.github.com/voodooGQ/4057330

0 讨论(0)
发布评论:

提交评论
- 加载中...
予麋鹿

2020-11-22 03:12

I would recommend not using regex. An API call like WinHttpCrackUrl() is less error prone.

http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx

0 讨论(0)
发布评论:

提交评论
- 加载中...