问题
I have some large base64 encoded data (stored in snappy files in the hadoop filesystem). This data was originally gzipped text data. I need to be able to read chunks of this encoded data, decode it, and then flush it to a GZIPOutputStream.
Any ideas on how I could do this instead of loading the whole base64 data into an array and calling Base64.decodeBase64(byte[]) ?
Am I right if I read the characters till the '\r\n' delimiter and decode it line by line? e.g. :
for (int i = 0; i < byteData.length; i++) {
if (byteData[i] == CARRIAGE_RETURN || byteData[i] == NEWLINE) {
if (i < byteData.length - 1 && byteData[i + 1] == NEWLINE)
i += 2;
else
i += 1;
byteBuffer.put(Base64.decodeBase64(record));
byteCounter = 0;
record = new byte[8192];
} else {
record[byteCounter++] = byteData[i];
}
}
Sadly, this approach doesn't give any human readable output. Ideally, I would like to stream read, decode, and stream out the data.
Right now, I'm trying to put in an inputstream and then copy to a gzipout
byteBuffer.get(bufferBytes);
InputStream inputStream = new ByteArrayInputStream(bufferBytes);
inputStream = new GZIPInputStream(inputStream);
IOUtils.copy(inputStream , gzipOutputStream);
And it gives me a java.io.IOException: Corrupt GZIP trailer
回答1:
Let's go step by step:
You need a
GZIPInputStream
to read zipped data (that and not aGZIPOutputStream
; the output stream is used to compress data). Having this stream you will be able to read the uncompressed, original binary data. This requires anInputStream
in the constructor.You need an input stream capable of reading the Base64 encoded data. I suggest the handy Base64InputStream from apache-commons-codec. With the constructor you can set the line length, the line separator and set
doEncode=false
to decode data. This in turn requires another input stream - the raw, Base64 encoded data.This stream depends on how you get your data; ideally the data should be available as
InputStream
- problem solved. If not, you may have to use theByteArrayInputStream
(if binary),StringBufferInputStream
(if string) etc.
Roughly this logic is:
InputStream fromHadoop = ...; // 3rd paragraph
Base64InputStream b64is = // 2nd paragraph
new Base64InputStream(fromHadoop, false, 80, "\n".getBytes("UTF-8"));
GZIPInputStream zis = new GZIPInputStream(b64is); // 1st paragraph
Please pay attention to the arguments of Base64InputStream
(line length and end-of-line byte array), you may need to tweak them.
回答2:
Thanks to Nikos for pointing me in the right direction. Specifically this is what I did:
private static final byte NEWLINE = (byte) '\n';
private static final byte CARRIAGE_RETURN = (byte) '\r';
byte[] lineSeparators = new byte[] {CARRIAGE_RETURN, NEWLINE};
Base64InputStream b64is = new Base64InputStream(inputStream, false, 76, lineSeparators);
GZIPInputStream zis = new GZIPInputStream(b64is);
Isn't 76 the length of the Base64 line? I didn't try with 80, though.
来源:https://stackoverflow.com/questions/19980307/stream-decoding-of-base64-data