I have url like:
sftp://user@host.net/some/random/path
I want to extract user, host and path from this string. Any part can be random lengt
Just needed to do the same, so was curious if it's possible to do it in single line, and this is what i've got:
#!/bin/bash
parse_url() {
eval $(echo "$1" | sed -e "s#^\(\(.*\)://\)\?\(\([^:@]*\)\(:\(.*\)\)\?@\)\?\([^/?]*\)\(/\(.*\)\)\?#${PREFIX:-URL_}SCHEME='\2' ${PREFIX:-URL_}USER='\4' ${PREFIX:-URL_}PASSWORD='\6' ${PREFIX:-URL_}HOST='\7' ${PREFIX:-URL_}PATH='\9'#")
}
URL=${1:-"http://user:pass@example.com/path/somewhere"}
PREFIX="URL_" parse_url "$URL"
echo "$URL_SCHEME://$URL_USER:$URL_PASSWORD@$URL_HOST/$URL_PATH"
How it works:
PS: be careful when using this for arbitrary input since this code is vulnerable to script injections.
If you have access to Bash >= 3.0 you can do this in pure bash as well, thanks to the re-match operator =~
:
pattern='^(([[:alnum:]]+)://)?(([[:alnum:]]+)@)?([^:^@]+)(:([[:digit:]]+))?$'
if [[ "http://us@cos.com:3142" =~ $pattern ]]; then
proto=${BASH_REMATCH[2]}
user=${BASH_REMATCH[4]}
host=${BASH_REMATCH[5]}
port=${BASH_REMATCH[7]}
fi
It should be faster and less resource-hungry then all the previous examples, because no external process is be spawned.
I don't have enough reputation to comment, but I made a small modification to @patryk-obara's answer.
RFC3986 § 6.2.3. Scheme-Based Normalization treats
http://example.com
http://example.com/
as equivalent. But I found that his regex did not match a URL like http://example.com. http://example.com/ (with the trailing slash) does match.
I inserted 11, which changed /
to (/|$)
. This matches either /
or the end of the string. Now http://example.com does match.
readonly URI_REGEX='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?((/|$)([^?#]*))(\?([^#]*))?(#(.*))?$'
# ↑↑ ↑ ↑↑↑ ↑ ↑ ↑ ↑↑ ↑ ↑ ↑ ↑ ↑
# || | ||| | | | || | | | | |
# |2 scheme | ||6 userinfo 7 host | 9 port || 12 rpath | 14 query | 16 fragment
# 1 scheme: | |5 userinfo@ 8 :... || 13 ?... 15 #...
# | 4 authority |11 / or end-of-string
# 3 //... 10 path
This solution in principle works the same as Adam Ryczkowski's, in this thread - but has improved regular expression based on RFC3986, (with some changes) and fixes some errors (e.g. userinfo can contain '_' character). This can also understand relative URIs (e.g. to extract query or fragment).
# !/bin/bash
# Following regex is based on https://tools.ietf.org/html/rfc3986#appendix-B with
# additional sub-expressions to split authority into userinfo, host and port
#
readonly URI_REGEX='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?(/([^?#]*))(\?([^#]*))?(#(.*))?'
# ↑↑ ↑ ↑↑↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
# |2 scheme | ||6 userinfo 7 host | 9 port | 11 rpath | 13 query | 15 fragment
# 1 scheme: | |5 userinfo@ 8 :… 10 path 12 ?… 14 #…
# | 4 authority
# 3 //…
parse_scheme () {
[[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[2]}"
}
parse_authority () {
[[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[4]}"
}
parse_user () {
[[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[6]}"
}
parse_host () {
[[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[7]}"
}
parse_port () {
[[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[9]}"
}
parse_path () {
[[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[10]}"
}
parse_rpath () {
[[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[11]}"
}
parse_query () {
[[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[13]}"
}
parse_fragment () {
[[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[15]}"
}
If you really want to do it in shell, you can do something as simple as the following by using awk. This requires knowing how many fields you will actually be passed (e.g. no password sometimes and not others).
#!/bin/bash
FIELDS=($(echo "sftp://user@host.net/some/random/path" \
| awk '{split($0, arr, /[\/\@:]*/); for (x in arr) { print arr[x] }}'))
proto=${FIELDS[1]}
user=${FIELDS[2]}
host=${FIELDS[3]}
path=$(echo ${FIELDS[@]:3} | sed 's/ /\//g')
If you don't have awk and you do have grep, and you can require that each field have at least two characters and be reasonably predictable in format, then you can do:
#!/bin/bash
FIELDS=($(echo "sftp://user@host.net/some/random/path" \
| grep -o "[a-z0-9.-][a-z0-9.-]*" | tr '\n' ' '))
proto=${FIELDS[1]}
user=${FIELDS[2]}
host=${FIELDS[3]}
path=$(echo ${FIELDS[@]:3} | sed 's/ /\//g')
[EDIT 2019] This answer is not meant to be a catch-all, works for everything solution it was intended to provide a simple alternative to the python based version and it ended up having more features than the original.
It answered the basic question in a bash-only way and then was modified multiple times by myself to include a hand full of demands by commenters. I think at this point however adding even more complexity would make it unmaintainable. I know not all things are straight forward (checking for a valid port for example requires comparing hostport
and host
) but I would rather not add even more complexity.
[Original answer]
Assuming your URL is passed as first parameter to the script:
#!/bin/bash
# extract the protocol
proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${1/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host and port
hostport="$(echo ${url/$user@/} | cut -d/ -f1)"
# by request host without port
host="$(echo $hostport | sed -e 's,:.*,,g')"
# by request - try to extract the port
port="$(echo $hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"
echo "url: $url"
echo " proto: $proto"
echo " user: $user"
echo " host: $host"
echo " port: $port"
echo " path: $path"
I must admit this is not the cleanest solution but it doesn't rely on another scripting language like perl or python. (Providing a solution using one of them would produce cleaner results ;) )
Using your example the results are:
url: user@host.net/some/random/path
proto: sftp://
user: user
host: host.net
port:
path: some/random/path
This will also work for URLs without a protocol/username or path. In this case the respective variable will contain an empty string.
[EDIT]
If your bash version won't cope with the substitutions (${1/$proto/}) try this:
#!/bin/bash
# extract the protocol
proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol -- updated
url=$(echo $1 | sed -e s,$proto,,g)
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host and port -- updated
hostport=$(echo $url | sed -e s,$user@,,g | cut -d/ -f1)
# by request host without port
host="$(echo $hostport | sed -e 's,:.*,,g')"
# by request - try to extract the port
port="$(echo $hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"