问题
I'm trying to parse Wiktionary dumps on the fly, directly from the URL, in Java. The Wiki dumps are distributed as compressed BZIP2 files, and I am using the following approach to attempt to parse them:
String fileURL = "https://dumps.wikimedia.org/cswiktionary/20171120/cswiktionary-20171120-pages-articles-multistream.xml.bz2";
URL bz2 = new URL(fileURL);
BufferedInputStream bis = new BufferedInputStream(bz2.openStream());
CompressorInputStream input = new CompressorStreamFactory().createCompressorInputStream(bis);
BufferedReader br2 = new BufferedReader(new InputStreamReader(input));
System.out.println(br2.lines().count());
However, the outputted line count is only 36, which is only a fraction of the total file, seeing it's over 20MB in size. Attempting to print the stream line-by-line, only a few lines of XML were actually printed:
String line = br2.readLine();
while(line != null) {
System.out.println(line);
line = br2.readLine();
}
Is there something I am missing here? I copied my implementation almost line-for-line from other chunks of code I found online, which others claimed to have worked. Why isn't the entire stream being read? Thanks in advance.
回答1:
So as it turns out, I was just being dumb. Wiktionary BZIP2 files are explicitly multistream (it even says so in the filename), and as a result, only one stream was being read in using the vanilla Commons Compress classes. You need a multistream reader in order to read multistream files, and from the looks of things, you have to write one yourself. I happened across the following implementation which worked for me:
https://chaosinmotion.blog/2011/07/29/and-another-curiosity-multi-stream-bzip2-files/
Hope this helps someone in the future :)
来源:https://stackoverflow.com/questions/47490231/why-cant-i-seem-to-read-an-entire-compressed-file-from-a-url-stream