How to safely get the file extension from a URL?

后端未结

关注

 9  1239

Consider the following URLs

http://m3u.com/tunein.m3u
http://asxsomeurl.com/listen.asx:8024
http://www.plssomeotherurl.com/station.pls?id=111
http://22.198.133.16:802

相关标签:

9条回答

清歌不尽

2021-02-02 09:00

$ python3
Python 3.1.2 (release31-maint, Sep 17 2010, 20:27:33) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from os.path import splitext
>>> from urllib.parse import urlparse 
>>> 
>>> urls = [
...     'http://m3u.com/tunein.m3u',
...     'http://asxsomeurl.com/listen.asx:8024',
...     'http://www.plssomeotherurl.com/station.pls?id=111',
...     'http://22.198.133.16:8024',
... ]
>>> 
>>> for url in urls:
...     path = urlparse(url).path
...     ext = splitext(path)[1]
...     print(ext)
... 
.m3u
.asx:8024
.pls

>>>

0 讨论(0)

轻奢々

2021-02-02 09:04

you can try the rfc6266 module like：

import requests
import rfc6266

req = requests.head(downloadLink)
headersContent = req.headers['Content-Disposition']
rfcFilename = rfc6266.parse_headers(headersContent, relaxed=True).filename_unsafe
filename = requests.utils.unquote(rfcFilename)

0 讨论(0)

既然无缘

2021-02-02 09:05

File extensions are basically meaningless in URLs. For example, if you go to http://code.google.com/p/unladen-swallow/source/browse/branches/release-2009Q1-maint/Lib/psyco/support.py?r=292 do you want the extension to be ".py" despite the fact that the page is HTML, not Python?

Use the Content-Type header to determine the "type" of a URL.

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2021-02-02 09:08
Use urlparse to parse the path out of the URL, then os.path.splitext to get the extension.
```
import urlparse, os

url = 'http://www.plssomeotherurl.com/station.pls?id=111'
path = urlparse.urlparse(url).path
ext = os.path.splitext(path)[1]
```
Note that the extension may not be a reliable indicator of the type of the file. The HTTP Content-Type header may be better.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2021-02-02 09:09
To get the content-type you can write a function one like I have written using urllib2. If you need to utilize page content anyway it is likely that you will use urllib2 so no need to import os.
```
import urllib2

def getContentType(pageUrl):
    page = urllib2.urlopen(pageUrl)
    pageHeaders = page.headers
    contentType = pageHeaders.getheader('content-type')
    return contentType
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2021-02-02 09:16

Use urlparse, that'll get most of the above sorted:

http://docs.python.org/library/urlparse.html

then split the "path" up. You might be able to split the path up using os.path.split, but your example 2 with the :8024 on the end needs manual handling. Are your file extensions always three letters? Or always letters and numbers? Use a regular expression.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页