问题
I am having the following data,
data
http://hsotname.com/2016/08/a-b-n-r-y-u
https://www.hostname.com/best-food-for-humans
http://www.hostname.com/wp-content/uploads/2014/07/a-w-w-2.jpg
http://www.hostname.com/a/geniusbar/
http://www.hsotname.com/m/
http://www.hsotname.com/
I want to avoid the first http:// or https:// and check for the last '/' and parse out the remaining parts of the URL. But the challenge here is, we have '/' on the end of few URLs as well. The output which I want is,
parsed
a-b-n-r-y-u
best-food-for-humans
a-w-w-2.jpg
NULL
NULL
NULL
Can anybody help me to find the last / and parse out the remaining part of the URL? I am new to regex and any help would be appreciated.
Thanks
回答1:
Another option is to simply split on "/" and take the last element:
"http://hsotname.com/2016/08/a-b-n-r-y-u".split("/")[-1]
# 'a-b-n-r-y-u'
"http://www.hostname.com/a/geniusbar/".split("/")[-1]
# ''
回答2:
Regexes are probably not the way you should do this - considering that you marked the question python
, try (assuming the URL is in name url
):
last-part = url.split('/')[-1]
This splits the URL into a list of substrings between slashes, and stores the last one in last-part
.
If you insist on using regexes, though, matching on the end of the string is helpful here. Try /[^/]*$
, which matches a slash, followed by any number of non-slashes, followed by the end of the string.
If you were to want to match the last non-empty part following a slash (if you didn't want the last three examples to return ""
), you could do /[^/]*/?$
, which allows but does not require a single slash at the very end.
回答3:
I'd go with something like this:
\/([^/]*)$
It'll match the last slash, then grab anything after it (if anything) that isn't a slash.
回答4:
Regex isn't the best tool in this case. Just use str.rfind:
[url[url.rfind('/'):] for url in data]
Will give you what you're looking for
回答5:
Possibly over kill for the example, but if you need to deal with location fragments/just location names (ie, the last forward slash is part of the http etc... (splitting http://hostname.com
and taking the last /
will give you hostname.com
- urlsplit
will give a path of ''
instead) then'll you're probably safer off using:
>>> from urllib.parse import urlsplit
>>> urls = ['http://hsotname.com/2016/08/a-b-n-r-y-u', 'https://www.hostname.com/best-food-for-humans', 'http://www.hostname.com/wp-content/uploads/2014/07/a-w-w-2.jpg', 'http://www.hostname.com/a/geniusbar/', 'http://www.hsotname.com/m/', 'http://www.hsotname.com/']
>>> [urlsplit(url).path.rpartition('/')[2] for url in urls]
['a-b-n-r-y-u', 'best-food-for-humans', 'a-w-w-2.jpg', '', '', '']
回答6:
Check from the end of the URL, and match every thing but /
[^/]+?$
or
\b[^/]+?\b$
来源:https://stackoverflow.com/questions/39233526/regex-to-parse-out-a-part-of-url