So, this is a seemingly simple question, but I\'m apparently very very dull. I have a little script that downloads all the .bz2 files from a webpage, but for some reason the dec
You're opening and reading the compressed file as if it was a textfile made up of lines. DON'T! It's NOT.
uncompressedData = bz2.BZ2File(zipFile).read()
seems to be closer to what you're angling for.
Edit: the OP has shown a few more things he's tried (though I don't see any notes about having tried the best method -- the one-liner I recommend above!) but they seem to all have one error in common, and I repeat the key bits from above:
opening ... the compressed file as if it was a textfile ... It's NOT.
open(filename)
and even the more explicit open(filename, 'r')
open, for reading, a text file -- a compressed file is a binary file, so in order to read it correctly you must open it with open(filename, 'rb')
. ((my recommended bz2.BZ2File
KNOWS it's dealing with a compressed file, of course, so there's no need to tell it anything more)).
In Python 2.*
, on Unix-y systems (i.e. every system except Windows), you could get away with a sloppy use of open
(but in Python 3.*
you can't, as text is Unicode, while binary is bytes -- different types).
In Windows (and before then in DOS) it's always been indispensable to distinguish, as Windows' text files, for historical reason, are peculiar (use two bytes rather than one to end lines, and, at least in some cases, take a byte worth '\0x1A'
as meaning a logical end of file) and so the reading and writing low-level code must compensate.
So I suspect the OP is using Windows and is paying the price for not carefully using the 'rb'
option ("read binary") to the open
built-in. (though bz2.BZ2File
is still simpler, whatever platform you're using!-).