I need to be able to search for text in a large number of files (.txt) that are zipped. Compression may be changed to something else or even became proprietary. I want to avoid
Searching for text in compressed files can be faster than searching for the same thing in uncompressed text files.
One compression technique I've seen that sacrifices some space in order to do fast searches:
In particular, searching for a single word usually reduces to comparing the 16-bit index in the compressed text, which is faster than searching for that word in the original text, because
Some kinds of regular expressions can be translated to another regular expression that directly finds items in the compressed file (and also perhaps also finds a few false positives). Such a search also does fewer comparisons than using the original regular expression on the original text file, because the compressed file is shorter, but typically each regular expression comparison requires more work, so it may or may not be faster than the original regex operating on the original text.
(In principle you could replace the fixed-length 16-bit codes with variable-length Huffman prefix codes, as rwong mentioned -- the resulting compressed file would be smaller, but the software to deal with those files would be a little slower and more complicated).
For more sophisticated techniques, you might look at
This is possible, and can be done quite efficiently. There's a lot of exciting research on this topic, more formally known as a Succinct data structure. Some topics I would recommend looking into: Wavelet tree, FM-index/RRR, succinct suffix arrays. You can also efficiently search Huffman encoded strings, as a number of publications have demonstrated.
I may be completely wrong here, but I don't think there'd be a reliable way to search for a given string without decoding the files. My understanding of compressions algorithms is that the bit-stream corresponding to a given string would depend greatly on what comes before the string in the uncompressed file. You may be able to find a given encoding for a particular string in a given file, but I'm pretty sure it wouldn't be consistent between files.
Most text files are compressed with one of the LZ-family of algorithms, which combine a Dictionary Coder together with an Entropy Coder such as Huffman.
Because the Dictionary Coder relies on a continuously-updated "dictionary", its coding result is dependent on the history (all codes in the dictionary that is derived from the input data up to the current symbol), so it is not possible to jump into a certain location and start decoding, without first decoding all of the previous data.
In my opinion, you can just use a zlib stream decoder which returns decompressed data as it goes without waiting for the entire file to be decompressed. This will not save execution time but will save memory.
A second suggestion is to do Huffman coding on English words, and forget about the Dictionary Coder part. Each English word gets mapped to a unique prefix-free code.
Finally, @SHODAN gave the most sensible suggestion, which is to index the files, compress the index and bundle with the compressed text files. To do a search, decompress just the index file and look up the words. This is in fact an improvement over doing the Huffman coding on words - once you found the frequency of words (in order to assign the prefix code optimally), you have already built the index, so you can keep the index for searching.
It is unlikely you'll be able to search for uncompressed strings in a compressed file. I guess one for your best options is to index the files somehow. Using Lucene perhaps?