How to safely get the file extension from a URL?

后端 未结 9 1226
北荒
北荒 2021-02-02 08:40

Consider the following URLs

http://m3u.com/tunein.m3u
http://asxsomeurl.com/listen.asx:8024
http://www.plssomeotherurl.com/station.pls?id=111
http://22.198.133.16:802         


        
相关标签:
9条回答
  • 2021-02-02 09:20

    The real proper way is to not use file extensions at all. Do a GET (or HEAD) request to the URL in question, and use the returned "Content-type" HTTP header to get the content type. File extensions are unreliable.

    See MIME types (IANA media types) for more information and a list of useful MIME types.

    0 讨论(0)
  • 2021-02-02 09:26

    This is easiest with requests and mimetypes:

    import requests
    import mimetypes
    
    response = requests.get(url)
    content_type = response.headers['content-type']
    extension = mimetypes.guess_extension(content_type)
    

    The extension includes a dot prefix. For example, extension is '.png' for content type 'image/png'.

    0 讨论(0)
  • 2021-02-02 09:26

    A different approach that takes nothing else into account except for the actual file extension from a url:

    def fileExt( url ):
        # compile regular expressions
        reQuery = re.compile( r'\?.*$', re.IGNORECASE )
        rePort = re.compile( r':[0-9]+', re.IGNORECASE )
        reExt = re.compile( r'(\.[A-Za-z0-9]+$)', re.IGNORECASE )
    
        # remove query string
        url = reQuery.sub( "", url )
    
        # remove port
        url = rePort.sub( "", url )
    
        # extract extension
        matches = reExt.search( url )
        if None != matches:
            return matches.group( 1 )
        return None
    

    edit: added handling of explicit ports from :1234

    0 讨论(0)
提交回复
热议问题