How to safely get the file extension from a URL?

后端 未结 9 1224
北荒
北荒 2021-02-02 08:40

Consider the following URLs

http://m3u.com/tunein.m3u
http://asxsomeurl.com/listen.asx:8024
http://www.plssomeotherurl.com/station.pls?id=111
http://22.198.133.16:802         


        
相关标签:
9条回答
  • 2021-02-02 09:00
    $ python3
    Python 3.1.2 (release31-maint, Sep 17 2010, 20:27:33) 
    [GCC 4.4.5] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from os.path import splitext
    >>> from urllib.parse import urlparse 
    >>> 
    >>> urls = [
    ...     'http://m3u.com/tunein.m3u',
    ...     'http://asxsomeurl.com/listen.asx:8024',
    ...     'http://www.plssomeotherurl.com/station.pls?id=111',
    ...     'http://22.198.133.16:8024',
    ... ]
    >>> 
    >>> for url in urls:
    ...     path = urlparse(url).path
    ...     ext = splitext(path)[1]
    ...     print(ext)
    ... 
    .m3u
    .asx:8024
    .pls
    
    >>> 
    
    0 讨论(0)
  • 2021-02-02 09:04

    you can try the rfc6266 module like:

    import requests
    import rfc6266
    
    req = requests.head(downloadLink)
    headersContent = req.headers['Content-Disposition']
    rfcFilename = rfc6266.parse_headers(headersContent, relaxed=True).filename_unsafe
    filename = requests.utils.unquote(rfcFilename)
    
    0 讨论(0)
  • 2021-02-02 09:05

    File extensions are basically meaningless in URLs. For example, if you go to http://code.google.com/p/unladen-swallow/source/browse/branches/release-2009Q1-maint/Lib/psyco/support.py?r=292 do you want the extension to be ".py" despite the fact that the page is HTML, not Python?

    Use the Content-Type header to determine the "type" of a URL.

    0 讨论(0)
  • 2021-02-02 09:08

    Use urlparse to parse the path out of the URL, then os.path.splitext to get the extension.

    import urlparse, os
    
    url = 'http://www.plssomeotherurl.com/station.pls?id=111'
    path = urlparse.urlparse(url).path
    ext = os.path.splitext(path)[1]
    

    Note that the extension may not be a reliable indicator of the type of the file. The HTTP Content-Type header may be better.

    0 讨论(0)
  • To get the content-type you can write a function one like I have written using urllib2. If you need to utilize page content anyway it is likely that you will use urllib2 so no need to import os.

    import urllib2
    
    def getContentType(pageUrl):
        page = urllib2.urlopen(pageUrl)
        pageHeaders = page.headers
        contentType = pageHeaders.getheader('content-type')
        return contentType
    
    0 讨论(0)
  • 2021-02-02 09:16

    Use urlparse, that'll get most of the above sorted:

    http://docs.python.org/library/urlparse.html

    then split the "path" up. You might be able to split the path up using os.path.split, but your example 2 with the :8024 on the end needs manual handling. Are your file extensions always three letters? Or always letters and numbers? Use a regular expression.

    0 讨论(0)
提交回复
热议问题