问题
I am going to use the wiktionary dump for the purpose of POS tagging. Somehow it gets stuck when downloading. Here is my code:
import nltk
from urllib import urlopen
from collections import Counter
import gzip
url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz'
fStream = gzip.open(urlopen(url).read(), 'rb')
dictFile = fStream.read()
fStream.close()
text = nltk.Text(word.lower() for word in dictFile())
tokens = nltk.word_tokenize(text)
Here is the error I get:
Traceback (most recent call last):
File "~/dir1/dir1/wikt.py", line 15, in <module>
fStream = gzip.open(urlopen(url).read(), 'rb')
File "/usr/lib/python2.7/gzip.py", line 34, in open
return GzipFile(filename, mode, compresslevel)
File "/usr/lib/python2.7/gzip.py", line 89, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
Process finished with exit code 1
回答1:
You are passing the downloaded data to gzip.open()
, which expects to be passed a filename instead.
The code then tries to open a filename named by the gzipped data, and fails.
Either save the URL data to a file, then use gzip.open()
on that, or decompress the gzipped data using the zlib
module instead. 'Saving' the data can be as easy as using a StringIO.StringIO()
in-memory file object:
from StringIO import StringIO
from urllib import urlopen
import gzip
url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz'
inmemory = StringIO(urlopen(url).read())
fStream = gzip.GzipFile(fileobj=inmemory, mode='rb')
来源:https://stackoverflow.com/questions/18146389/urlopen-trouble-while-trying-to-download-a-gzip-file