Read contents of .tar.gz file from website into a python 3.x object

老子叫甜甜 提交于 2019-12-01 07:12:30

问题


I am new to python. I can't figure out what I am doing wrong when trying to read the contents of .tar.gz file into python. The tarfile I would like to read is hosted at the following web address:

ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz

more info on file at this site (just so you can trust contents) http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901

The tarfile contains .pdf and .nxml copies of the journal article. And also a couple of image files.

If I open the file in my browser by copying and pasting. I can save to a location on my PC and import the tarfile fine using the following commands (note: winzip changes the file from .tar.gz to simply .tar when I save to location):

import tarfile
thetarfile = "C:/Users/dfcm/Documents/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar"
tfile = tarfile.open(thetarfile)
tfile

However, if I try to access the file directly using similar commands:

thetarfile = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
bbb = tarfile.open(thetarfile)

That results in the following error:

 Traceback (most recent call last):
 File "<pyshell#137>", line 1, in <module>
 bbb = tarfile.open(thetarfile)
 File "C:\Python30\lib\tarfile.py", line 1625, in open
 return func(name, "r", fileobj, **kwargs)
 File "C:\Python30\lib\tarfile.py", line 1687, in gzopen
 fileobj = bltn_open(name, mode + "b")
 File "C:\Python30\lib\io.py", line 278, in __new__
 return open(*args, **kwargs)
 File "C:\Python30\lib\io.py", line 222, in open
 closefd)
 File "C:\Python30\lib\io.py", line 615, in __init__
 _fileio._FileIO.__init__(self, name, mode, closefd)
 IOError: [Errno 22] Invalid     argument: 'ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar'

Can anyone explain what I am doing wrong when trying to read the .tar.gz file directly from the web address? Thanks in advance. Chris


回答1:


Unfortunately you cannot just open files from the network. Things are a bit more complex here. You have to instruct the interpreter to create a network request and create an object representing the request state. This can be done using the urllib module.

import urllib.request
import tarfile
thetarfile = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(thetarfile)
thetarfile = tarfile.open(fileobj=ftpstream, mode="r|gz")

The ftpstream object is a file-like that represents the connection to the ftp server. Then the tarfile module can access this stream. Since we do not pass the filename, we have to specify the compression in the mode parameter.



来源:https://stackoverflow.com/questions/18617244/read-contents-of-tar-gz-file-from-website-into-a-python-3-x-object

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!