In my bash script I need to extract just the path from the given URL. For example, from the variable containing string:
http://login:password@example.com/one/more/dir/fi
I wrote a function to that will extract any part or the URL. I've only tested it in bash. Usage:
url_parse <url> [url-part]
example:
$ url_parse "http://example.com:8080/home/index.html" path
home/index.html
code:
url_parse() {
local -r url=$1 url_part=$2
#define url tokens and url regular expression
local -r protocol='^[^:]+' user='[^:@]+' password='[^@]+' host='[^:/?#]+' \
port='[0-9]+' path='\/([^?#]*)' query='\?([^#]+)' fragment='#(.*)'
local -r auth="($user)(:($password))?@"
local -r connection="($auth)?($host)(:($port))?"
local -r url_regex="($protocol):\/\/($connection)?($path)?($query)?($fragment)?$"
#parse url and create an array
IFS=',' read -r -a url_arr <<< $(echo $url | awk -v OFS=, \
"{match(\$0,/$url_regex/,a);print a[1],a[4],a[6],a[7],a[9],a[11],a[13],a[15]}")
[[ ${url_arr[0]} ]] || { echo "Invalid URL: $url" >&2 ; return 1 ; }
case $url_part in
protocol) echo ${url_arr[0]} ;;
auth) echo ${url_arr[1]}:${url_arr[2]} ;; # ex: john.doe:1234
user) echo ${url_arr[1]} ;;
password) echo ${url_arr[2]} ;;
host-port)echo ${url_arr[3]}:${url_arr[4]} ;; #ex: example.com:8080
host) echo ${url_arr[3]} ;;
port) echo ${url_arr[4]} ;;
path) echo ${url_arr[5]} ;;
query) echo ${url_arr[6]} ;;
fragment) echo ${url_arr[7]} ;;
info) echo -e "protocol:${url_arr[0]}\nuser:${url_arr[1]}\npassword:${url_arr[2]}\nhost:${url_arr[3]}\nport:${url_arr[4]}\npath:${url_arr[5]}\nquery:${url_arr[6]}\nfragment:${url_arr[7]}";;
"") ;; # used to validate url
*) echo "Invalid URL part: $url_part" >&2 ; return 1 ;;
esac
}
This perl one-liner works for me on the command line, so could be added to your script.
echo 'http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth' | perl -n -e 'm{http://[^/]+(/[^?]+)};print $1'
Note that this assumes there will always be a '?' character at the end of the string you want to extract.
There are built-in functions in bash to handle this, e.g., the string pattern-matching operators:
For example:
FILE=/home/user/src/prog.c
echo ${FILE#/*/} # ==> user/src/prog.c
echo ${FILE##/*/} # ==> prog.c
echo ${FILE%/*} # ==> /home/user/src
echo ${FILE%%/*} # ==> nil
echo ${FILE%.c} # ==> /home/user/src/prog
All this from the excellent book: "A Practical Guide to Linux Commands, Editors, and Shell Programming by Mark G. Sobell (http://www.sobell.com/)
gawk
echo "http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth" | awk -F"/" '
{
$1=$2=$3=""
gsub(/\?.*/,"",$NF)
print substr($0,3)
}' OFS="/"
output
# ./test.sh
/one/more/dir/file.exe
url="http://login:password@example.com/one/more/dir/file.exe?a=sth&b=sth"
grep
$ grep -Po '\w\K/\w+[^?]+' <<<$url
/one/more/dir/file.exe
grep
$ grep -o '\w/\w\+[^?]\+' <<<$url | tail -c+2
/one/more/dir/file.exe
$ rg -o '\w(/\w+[^?]+)' -r '$1' <<<$url
/one/more/dir/file.exe
To get other parts of URL, check: Getting parts of a URL (Regex).
I agree that "cut" is a wonderful tool on the command line. However, a more purely bash solution is to use a powerful feature of variable expansion in bash. For example:
pass_first_last='password,firstname,lastname'
pass=${pass_first_last%%,*}
first_last=${pass_first_last#*,}
first=${first_last%,*}
last=${first_last#*,}
or, alternatively,
last=${pass_first_last##*,}