How can I get the base of a URL in Python?

前端 未结 8 2471
情书的邮戳
情书的邮戳 2021-02-12 12:34

I\'m trying to determine the base of a URL, or everything besides the page and parameters. I tried using split, but is there a better way than splitting it up into pieces? Is th

相关标签:
8条回答
  • 2021-02-12 12:59

    There is shortest solution for Python3 with use of urllib library (don't know if fastest):

    from urllib.parse import urljoin
    
    base_url = urljoin('http://127.0.0.1/asdf/login.php', '.')
    # output: http://127.0.0.1/asdf/
    

    Keep in mind that urllib library supports uri/url compatible with HTML's keyword. It means that uri/url ending with '/' means different that without like here https://stackoverflow.com/a/1793282/7750840/:

    base_url = urljoin('http://127.0.0.1/asdf/', '.')
    # output: http://127.0.0.1/asdf/
    
    base_url = urljoin('http://127.0.0.1/asdf', '.')
    # output: http://127.0.0.1/
    

    This is link to urllib for python: https://pythonprogramming.net/urllib-tutorial-python-3/

    0 讨论(0)
  • 2021-02-12 13:06

    Well, for one, you could just use os.path.dirname:

    >>> os.path.dirname('http://127.0.0.1/asdf/login.php')
    'http://127.0.0.1/asdf'
    

    It's not explicitly for URLs, but it happens to work on them (even on Windows), it just doesn't leave the trailing slash (you can just add it back yourself).

    You may also want to look at urllib.parse.urlparse for more fine-grained parsing; if the URL has a query string or hash involved, you'd want to parse it into pieces, trim the path component returned by parsing, then recombine, so the path is trimmed without losing query and hash info.

    Lastly, if you want to just split off the component after the last slash, you can do an rsplit with a maxsplit of 1, and keep the first component:

    >>> 'http://127.0.0.1/asdf/login.php'.rsplit('/', 1)[0]
    'http://127.0.0.1/asdf'
    
    0 讨论(0)
  • 2021-02-12 13:11

    No need to use a regex, you can just use rsplit():

    >>> url = 'http://127.0.0.1/asdf/login.php'
    >>> url.rsplit('/', 1)[0]
    'http://127.0.0.1/asdf'
    
    0 讨论(0)
  • 2021-02-12 13:14

    Get the right-most occurence of slash; use the string slice through that position in the original string. The +1 gets you that final slash at the end.

    link = "http://127.0.0.1/asdf/login.php"
    link[:link.rfind('/')+1]
    
    0 讨论(0)
  • 2021-02-12 13:16

    When you use urlsplit, it returns a SplitResult object:

    from urllib.parse import urlsplit
    split_url = urlsplit('http://127.0.0.1/asdf/login.php')
    print(split_url)
    
    >>> SplitResult(scheme='http' netloc='127.0.0.1' path='/asdf/login.php' query='' fragment='') 
    

    You can make your own SplitResult() object and pass it through urlunsplit. This code should work for multiple url splits, regardless of their length, as long as you know what the last path element you want is.

    from urllib.parse import urlsplit, urlunsplit, SplitResult
    
    # splitting url:
    split_url = urlsplit('http://127.0.0.1/asdf/login.php')
    
    # editing the variables you want to change (in this case, path):    
    last_element = 'asdf'   # this can be any element in the path.
    path_array = split_url.path.split('/')
    
    # print(path_array)
    # >>> ['', 'asdf', 'login.php']
    
    path_array.remove('') 
    ind = path_array.index(last_element) 
    new_path = '/' + '/'.join(path_array[:ind+1]) + '/'
    
    # making SplitResult() object with edited data:
    new_url = SplitResult(scheme=split_url.scheme, netloc=split_url.netloc, path=new_path, query='', fragment='')
    
    # unsplitting:
    base_url = urlunsplit(new_url)
    
    0 讨论(0)
  • 2021-02-12 13:19

    The best way to do this is use urllib.parse.

    From the docs:

    The module has been designed to match the Internet RFC on Relative Uniform Resource Locators. It supports the following URL schemes: file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn, svn+ssh, telnet, wais, ws, wss.

    You'd want to do something like this using urlsplit and urlunsplit:

    from urllib.parse import urlsplit, urlunsplit
    
    split_url = urlsplit('http://127.0.0.1/asdf/login.php?q=abc#stackoverflow')
    
    # You now have:
    # split_url.scheme   "http"
    # split_url.netloc   "127.0.0.1" 
    # split_url.path     "/asdf/login.php"
    # split_url.query    "q=abc"
    # split_url.fragment "stackoverflow"
    
    # Use all the path except everything after the last '/' 
    clean_path = "".join(split_url.path.rpartition("/")[:-1])
    
    # "/asdf/"
    
    # urlunsplit joins a urlsplit tuple
    clean_url = urlunsplit(split_url)
    
    # "http://127.0.0.1/asdf/login.php?q=abc#stackoverflow"
    
    
    # A more advanced example 
    advanced_split_url = urlsplit('http://foo:bar@127.0.0.1:5000/asdf/login.php?q=abc#stackoverflow')
    
    # You now have *in addition* to the above:
    # advanced_split_url.username   "foo"
    # advanced_split_url.password   "bar"
    # advanced_split_url.hostname   "127.0.0.1"
    # advanced_split_url.port       "5000"
    
    0 讨论(0)
提交回复
热议问题