Regex to parse out a part of URL

问题

I am having the following data,

data
http://hsotname.com/2016/08/a-b-n-r-y-u
https://www.hostname.com/best-food-for-humans
http://www.hostname.com/wp-content/uploads/2014/07/a-w-w-2.jpg
http://www.hostname.com/a/geniusbar/
http://www.hsotname.com/m/
http://www.hsotname.com/

I want to avoid the first http:// or https:// and check for the last '/' and parse out the remaining parts of the URL. But the challenge here is, we have '/' on the end of few URLs as well. The output which I want is,

parsed
a-b-n-r-y-u
best-food-for-humans
a-w-w-2.jpg
NULL
NULL 
NULL

Can anybody help me to find the last / and parse out the remaining part of the URL? I am new to regex and any help would be appreciated.

Thanks

回答1:

Another option is to simply split on "/" and take the last element:

"http://hsotname.com/2016/08/a-b-n-r-y-u".split("/")[-1]
# 'a-b-n-r-y-u'

"http://www.hostname.com/a/geniusbar/".split("/")[-1]
# ''

回答2:

Regexes are probably not the way you should do this - considering that you marked the question python, try (assuming the URL is in name url):

last-part = url.split('/')[-1]

This splits the URL into a list of substrings between slashes, and stores the last one in last-part.

If you insist on using regexes, though, matching on the end of the string is helpful here. Try /[^/]*$, which matches a slash, followed by any number of non-slashes, followed by the end of the string.

If you were to want to match the last non-empty part following a slash (if you didn't want the last three examples to return ""), you could do /[^/]*/?$, which allows but does not require a single slash at the very end.

回答3:

I'd go with something like this:

\/([^/]*)$

It'll match the last slash, then grab anything after it (if anything) that isn't a slash.

回答4:

Regex isn't the best tool in this case. Just use str.rfind:

[url[url.rfind('/'):] for url in data]

Will give you what you're looking for

回答5:

Possibly over kill for the example, but if you need to deal with location fragments/just location names (ie, the last forward slash is part of the http etc... (splitting http://hostname.com and taking the last / will give you hostname.com - urlsplit will give a path of '' instead) then'll you're probably safer off using:

>>> from urllib.parse import urlsplit
>>> urls = ['http://hsotname.com/2016/08/a-b-n-r-y-u', 'https://www.hostname.com/best-food-for-humans', 'http://www.hostname.com/wp-content/uploads/2014/07/a-w-w-2.jpg', 'http://www.hostname.com/a/geniusbar/', 'http://www.hsotname.com/m/', 'http://www.hsotname.com/']
>>> [urlsplit(url).path.rpartition('/')[2] for url in urls]
['a-b-n-r-y-u', 'best-food-for-humans', 'a-w-w-2.jpg', '', '', '']

回答6:

Check from the end of the URL, and match every thing but /

[^/]+?$

\b[^/]+?\b$

来源：https://stackoverflow.com/questions/39233526/regex-to-parse-out-a-part-of-url

标签

python

regex

regex-negation