I am looking for a way to extract a filename and extension from a particular url using Python
lets say a URL looks as follows
picture_page = \"http://dis
>>> import re
>>> s = 'picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"'
>>> re.findall(r'\/([a-zA-Z0-9_]*)\.[a-zA-Z]*\"$',s)[0]
'da4ca3509a7b11e19e4a12313813ffc0_7'
>>> re.findall(r'([a-zA-Z]*)\"$',s)[0]
'jpg'
try:
# Python 3
from urllib.parse import urlparse
except ImportError:
# Python 2
from urlparse import urlparse
from os.path import splitext, basename
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
disassembled = urlparse(picture_page)
filename, file_ext = splitext(basename(disassembled.path))
Only downside with this is that your filename will contain a preceding / which you can always remove yourself.
filename = picture_page.split('/')[-1].split('.')[0]
file_ext = '.'+picture_page.split('.')[-1]
# Here's your link:
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
#Here's your filename and ext:
filename, ext = (picture_page.split('/')[-1].split('.'))
When you do picture_page.split('/'), it will return a list of strings from your url split by a /
.
If you know python list indexing well, you'd know that -1 will give you the last element or the first element from the end of the list.
In your case, it will be the filename: da4ca3509a7b11e19e4a12313813ffc0_7.jpg
Splitting that by delimeter .
, you get two values:
da4ca3509a7b11e19e4a12313813ffc0_7
and jpg
, as expected, because they are separated by a period which you used as a delimeter in your split() call.
Now, since the last split returns two values in the resulting list, you can tuplify it. Hence, basically, the result would be like:
filename,ext = ('da4ca3509a7b11e19e4a12313813ffc0_7', 'jpg')
os.path.splitext
will help you extract the filename and extension once you have extracted the relevant string from the URL using urlparse
:
fName, ext = os.path.splitext('yourImage.jpg')
This is the easiest way to find image name and extension using regular expression.
import re
import sys
picture_page = "http://distilleryimage2.instagram.com/da4ca3509a7b11e19e4a12313813ffc0_7.jpg"
regex = re.compile('(.*\/(?P<name>\w+)\.(?P<ext>\w+))')
print regex.search(picture_page).group('name')
print regex.search(picture_page).group('ext')