Parse URL in shell script

后端 未结 14 1627
一生所求
一生所求 2020-12-02 20:55

I have url like:

sftp://user@host.net/some/random/path

I want to extract user, host and path from this string. Any part can be random lengt

相关标签:
14条回答
  • 2020-12-02 21:30

    Just needed to do the same, so was curious if it's possible to do it in single line, and this is what i've got:

    #!/bin/bash
    
    parse_url() {
      eval $(echo "$1" | sed -e "s#^\(\(.*\)://\)\?\(\([^:@]*\)\(:\(.*\)\)\?@\)\?\([^/?]*\)\(/\(.*\)\)\?#${PREFIX:-URL_}SCHEME='\2' ${PREFIX:-URL_}USER='\4' ${PREFIX:-URL_}PASSWORD='\6' ${PREFIX:-URL_}HOST='\7' ${PREFIX:-URL_}PATH='\9'#")
    }
    
    URL=${1:-"http://user:pass@example.com/path/somewhere"}
    PREFIX="URL_" parse_url "$URL"
    echo "$URL_SCHEME://$URL_USER:$URL_PASSWORD@$URL_HOST/$URL_PATH"
    

    How it works:

    1. There is that crazy sed regex that captures all the parts of url, when all of them are optional (except for the host name)
    2. Using those capture groups sed outputs env variables names with their values for relevant parts (like URL_SCHEME or URL_USER)
    3. eval executes that output, causing those variables to be exported and available in the script
    4. Optionally PREFIX could be passed to control output env variables names

    PS: be careful when using this for arbitrary input since this code is vulnerable to script injections.

    0 讨论(0)
  • 2020-12-02 21:32

    If you have access to Bash >= 3.0 you can do this in pure bash as well, thanks to the re-match operator =~:

    pattern='^(([[:alnum:]]+)://)?(([[:alnum:]]+)@)?([^:^@]+)(:([[:digit:]]+))?$'
    if [[ "http://us@cos.com:3142" =~ $pattern ]]; then
            proto=${BASH_REMATCH[2]}
            user=${BASH_REMATCH[4]}
            host=${BASH_REMATCH[5]}
            port=${BASH_REMATCH[7]}
    fi
    

    It should be faster and less resource-hungry then all the previous examples, because no external process is be spawned.

    0 讨论(0)
  • 2020-12-02 21:32

    I don't have enough reputation to comment, but I made a small modification to @patryk-obara's answer.

    RFC3986 § 6.2.3. Scheme-Based Normalization treats

    http://example.com
    http://example.com/
    

    as equivalent. But I found that his regex did not match a URL like http://example.com. http://example.com/ (with the trailing slash) does match.

    I inserted 11, which changed / to (/|$). This matches either / or the end of the string. Now http://example.com does match.

    readonly URI_REGEX='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?((/|$)([^?#]*))(\?([^#]*))?(#(.*))?$'
    #                    ↑↑            ↑  ↑↑↑            ↑         ↑ ↑            ↑↑    ↑        ↑  ↑        ↑ ↑
    #                    ||            |  |||            |         | |            ||    |        |  |        | |
    #                    |2 scheme     |  ||6 userinfo   7 host    | 9 port       ||    12 rpath |  14 query | 16 fragment
    #                    1 scheme:     |  |5 userinfo@             8 :...         ||             13 ?...     15 #...
    #                                  |  4 authority                             |11 / or end-of-string
    #                                  3  //...                                   10 path
    
    0 讨论(0)
  • 2020-12-02 21:34

    This solution in principle works the same as Adam Ryczkowski's, in this thread - but has improved regular expression based on RFC3986, (with some changes) and fixes some errors (e.g. userinfo can contain '_' character). This can also understand relative URIs (e.g. to extract query or fragment).

    # !/bin/bash
    
    # Following regex is based on https://tools.ietf.org/html/rfc3986#appendix-B with
    # additional sub-expressions to split authority into userinfo, host and port
    #
    readonly URI_REGEX='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?(/([^?#]*))(\?([^#]*))?(#(.*))?'
    #                    ↑↑            ↑  ↑↑↑            ↑         ↑ ↑            ↑ ↑        ↑  ↑        ↑ ↑
    #                    |2 scheme     |  ||6 userinfo   7 host    | 9 port       | 11 rpath |  13 query | 15 fragment
    #                    1 scheme:     |  |5 userinfo@             8 :…           10 path    12 ?…       14 #…
    #                                  |  4 authority
    #                                  3 //…
    
    parse_scheme () {
        [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[2]}"
    }
    
    parse_authority () {
        [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[4]}"
    }
    
    parse_user () {
        [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[6]}"
    }
    
    parse_host () {
        [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[7]}"
    }
    
    parse_port () {
        [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[9]}"
    }
    
    parse_path () {
        [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[10]}"
    }
    
    parse_rpath () {
        [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[11]}"
    }
    
    parse_query () {
        [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[13]}"
    }
    
    parse_fragment () {
        [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[15]}"
    }
    
    0 讨论(0)
  • 2020-12-02 21:36

    If you really want to do it in shell, you can do something as simple as the following by using awk. This requires knowing how many fields you will actually be passed (e.g. no password sometimes and not others).

    #!/bin/bash
    
    FIELDS=($(echo "sftp://user@host.net/some/random/path" \
      | awk '{split($0, arr, /[\/\@:]*/); for (x in arr) { print arr[x] }}'))
    proto=${FIELDS[1]}
    user=${FIELDS[2]}
    host=${FIELDS[3]}
    path=$(echo ${FIELDS[@]:3} | sed 's/ /\//g')
    

    If you don't have awk and you do have grep, and you can require that each field have at least two characters and be reasonably predictable in format, then you can do:

    #!/bin/bash
    
    FIELDS=($(echo "sftp://user@host.net/some/random/path" \
       | grep -o "[a-z0-9.-][a-z0-9.-]*" | tr '\n' ' '))
    proto=${FIELDS[1]}
    user=${FIELDS[2]}
    host=${FIELDS[3]}
    path=$(echo ${FIELDS[@]:3} | sed 's/ /\//g')
    
    0 讨论(0)
  • 2020-12-02 21:40

    [EDIT 2019] This answer is not meant to be a catch-all, works for everything solution it was intended to provide a simple alternative to the python based version and it ended up having more features than the original.


    It answered the basic question in a bash-only way and then was modified multiple times by myself to include a hand full of demands by commenters. I think at this point however adding even more complexity would make it unmaintainable. I know not all things are straight forward (checking for a valid port for example requires comparing hostport and host) but I would rather not add even more complexity.


    [Original answer]

    Assuming your URL is passed as first parameter to the script:

    #!/bin/bash
    
    # extract the protocol
    proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"
    # remove the protocol
    url="$(echo ${1/$proto/})"
    # extract the user (if any)
    user="$(echo $url | grep @ | cut -d@ -f1)"
    # extract the host and port
    hostport="$(echo ${url/$user@/} | cut -d/ -f1)"
    # by request host without port    
    host="$(echo $hostport | sed -e 's,:.*,,g')"
    # by request - try to extract the port
    port="$(echo $hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"
    # extract the path (if any)
    path="$(echo $url | grep / | cut -d/ -f2-)"
    
    echo "url: $url"
    echo "  proto: $proto"
    echo "  user: $user"
    echo "  host: $host"
    echo "  port: $port"
    echo "  path: $path"
    

    I must admit this is not the cleanest solution but it doesn't rely on another scripting language like perl or python. (Providing a solution using one of them would produce cleaner results ;) )

    Using your example the results are:

    url: user@host.net/some/random/path
      proto: sftp://
      user: user
      host: host.net
      port:
      path: some/random/path
    

    This will also work for URLs without a protocol/username or path. In this case the respective variable will contain an empty string.

    [EDIT]
    If your bash version won't cope with the substitutions (${1/$proto/}) try this:

    #!/bin/bash
    
    # extract the protocol
    proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"
    
    # remove the protocol -- updated
    url=$(echo $1 | sed -e s,$proto,,g)
    
    # extract the user (if any)
    user="$(echo $url | grep @ | cut -d@ -f1)"
    
    # extract the host and port -- updated
    hostport=$(echo $url | sed -e s,$user@,,g | cut -d/ -f1)
    
    # by request host without port
    host="$(echo $hostport | sed -e 's,:.*,,g')"
    # by request - try to extract the port
    port="$(echo $hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"
    
    # extract the path (if any)
    path="$(echo $url | grep / | cut -d/ -f2-)"
    
    0 讨论(0)
提交回复
热议问题