How to safely get the file extension from a URL?

后端未结

关注

 9  1240

Consider the following URLs

http://m3u.com/tunein.m3u
http://asxsomeurl.com/listen.asx:8024
http://www.plssomeotherurl.com/station.pls?id=111
http://22.198.133.16:802

相关标签:

9条回答

面向向阳花

2021-02-02 09:20

The real proper way is to not use file extensions at all. Do a GET (or HEAD) request to the URL in question, and use the returned "Content-type" HTTP header to get the content type. File extensions are unreliable.

See MIME types (IANA media types) for more information and a list of useful MIME types.

0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2021-02-02 09:26
This is easiest with requests and mimetypes:
```
import requests
import mimetypes

response = requests.get(url)
content_type = response.headers['content-type']
extension = mimetypes.guess_extension(content_type)
```
The extension includes a dot prefix. For example, extension is '.png' for content type 'image/png'.
0 讨论(0)
发布评论:

提交评论
- 加载中...

后悔当初

2021-02-02 09:26

A different approach that takes nothing else into account except for the actual file extension from a url:

def fileExt( url ):
    # compile regular expressions
    reQuery = re.compile( r'\?.*$', re.IGNORECASE )
    rePort = re.compile( r':[0-9]+', re.IGNORECASE )
    reExt = re.compile( r'(\.[A-Za-z0-9]+$)', re.IGNORECASE )

    # remove query string
    url = reQuery.sub( "", url )

    # remove port
    url = rePort.sub( "", url )

    # extract extension
    matches = reExt.search( url )
    if None != matches:
        return matches.group( 1 )
    return None

edit: added handling of explicit ports from :1234

0 讨论(0)

上一页 1 2