问题
I'm currently implementing an Ereader library (skyepub) that requires that I implement a method that checks if a zipEntry exists or not. In their demo version, the solution is simple:
public boolean isExists(String baseDirectory,String contentPath) {
setupZipFile(baseDirectory,contentPath);
if (this.isCustomFont(contentPath)) {
String path = baseDirectory +"/"+ contentPath;
File file = new File(path);
return file.exists();
}
ZipEntry entry = this.getZipEntry(contentPath);
if (entry==null) return false;
else return true;
}
// Entry name should start without / like META-INF/container.xml
private ZipEntry getZipEntry(String contentPath) {
if (zipFile==null) return null;
String[] subDirs = contentPath.split(Pattern.quote(File.separator));
String corePath = contentPath.replace(subDirs[1], "");
corePath=corePath.replace("//", "");
ZipEntry entry = zipFile.getEntry(corePath.replace(File.separatorChar, '/'));
return entry;
}
So as you can see, you can access the ZipEntry in question in O(1) time using getZipEntry(contentPath);
However, in my case I cannot read the zipfile straight from the file system (it must be read from in memory for security reasons).. And so my ifExists
implementation actually goes through the zip file one entry at a time, until it finds the zipEntry in question, here is the relevant part:
try {
final InputStream stream = dbUtil.getBookStream(bookEditionID);
if( stream == null) return null;
final ZipInputStream zip = new ZipInputStream(stream);
ZipEntry entry;
do {
entry = zip.getNextEntry();
if( entry == null) {
zip.close();
return null;
}
} while( !entry.getName().equals(zipEntryName));
} catch( IOException e) {
Log.e("demo", "Can't get content data for "+contentPath);
return null;
}
return data;
and so if data exists, ifExists
returns true, otherwise false if null.
Question
Is there a way I can find the zip entry in question from the entire ZipInputStream in O(1) time rather than O(n) time?
Related
See this question and this answer.
回答1:
An entry in a zip archive cannot really be loaded in O(1) time. If we look at the structure of a zip archive, it looks like this:
[local file header 1]
[encryption header 1]
[file data 1]
[data descriptor 1]
...
[local file header n]
[encryption header n]
[file data n]
[data descriptor n]
[archive decryption header]
[archive extra data record]
[central directory header 1]
.
[central directory header n]
[zip64 end of central directory record]
[zip64 end of central directory locator]
[end of central directory record]
Basically, there are compressed files with some headers plus a "central directory" which contains all metadata about the files (central directory headers). The only valid way how to locate an entry is by scanning the central directory (more info):
...must not scan for entries from the top of the ZIP file, because only the central directory specifies where a file chunk starts
Because there is no index over central directory headers, you can only get an entry in O(n)
where n
is the number of files in the archive.
Update: Unfortunately, all zip libraries I know of which work with streams rather than files do use local file headers and scan the entire stream including contents. They cannot be easily bent either. The only way how to avoid scanning the entire archive I found is adapting a library yourself.
Update 2: I have taken the liberty of modifying the aforementioned zip4j library for your purposes. Assuming you have your zip file read in a byte array and you have added a dependency on zip4j version 1.3.2, you can use MemoryHeaderReader and RandomByteStream like this:
String myZipFile = "...";
byte[] bytes = readFile();
MemoryHeaderReader headerReader = new MemoryHeaderReader(RandomAccessStream.fromBytes(bytes));
ZipModel zipModel = headerReader.readAllHeaders();
FileHeader myFile = Zip4jUtil.getFileHeader(zipModel, myZipFile)
boolean fileIsPresent = myFile != null;
It works in O(entryCount) without reading the entire archive which should be reasonably fast. I haven't thoroughly tested it, but it should give you an idea how you can adjust zip4j for your purposes.
回答2:
If the archive's content is in memory, then it is seekable and you could search for the central directory and use that yourself. Neither ZipFile
nor Apache Commons Compress' equivalent work on anything else but File
s right now, but other open source libraries might (not sure about zip4j).
The code inside of Apache Commons Compress' ZipFile
that searches for the central directory and parses it should be pretty easy to adapt to a case where the archive is available as a byte[]
. In fact there is a patch that hasn't been applied that could help as part of COMPRESS-327.
回答3:
Technically the search is always O(n), where n is the number of entries in the zip file, since you have to do a linear search either through the central directory or through the local headers.
You seem to imply that the zip file is loaded entirely into memory. In that case, the fastest thing to do is to search for the entry in the central directory. If you find it, that directory entry will then point to the local header.
If you are doing a lot of searches on the same zip file, then you can build a hash table of the names in the central directory once in O(n) time, and then use that to search for a given name in approximately O(1) time.
来源:https://stackoverflow.com/questions/36809245/how-to-access-a-zipentry-from-a-streamed-zip-file-in-memory