Fast search in compressed text files

前端未结

关注

 5  630

I need to be able to search for text in a large number of files (.txt) that are zipped. Compression may be changed to something else or even became proprietary. I want to avoid

相关标签:

5条回答

面向向阳花

2021-02-04 19:37
Searching for text in compressed files can be faster than searching for the same thing in uncompressed text files.

One compression technique I've seen that sacrifices some space in order to do fast searches:
- maintain a dictionary with 2^16 entries of every word in the text. Reserve the first 256 entries for literal bytes, in case you come upon a word that isn't in the dictionary -- even though many large texts have fewer than 32,000 unique words, so they never need to use those literal bytes.
- Compress the original text by substituting the 16-bit dictionary index for each word.
- (optional) In the normal case that two words are separated by a single space character, discard that space character; otherwise put all the bytes in the string between words into the dictionary as a special "word" (for example, ". " and ", " and "\n") tagged with the "no default spaces" attribute, and then "compress" those strings by replacing them with the corresponding dictionary index.
- Search for words or phrases by compressing the phrase in the same way, and searching for the compressed string of bytes in the compressed text in exactly the same way you would search for the original string in the original text.
In particular, searching for a single word usually reduces to comparing the 16-bit index in the compressed text, which is faster than searching for that word in the original text, because
- each comparison requires comparing fewer bytes -- 2, rather than however many bytes were in that word, and
- we're doing fewer comparisons, because the compressed file is shorter.
Some kinds of regular expressions can be translated to another regular expression that directly finds items in the compressed file (and also perhaps also finds a few false positives). Such a search also does fewer comparisons than using the original regular expression on the original text file, because the compressed file is shorter, but typically each regular expression comparison requires more work, so it may or may not be faster than the original regex operating on the original text.

(In principle you could replace the fixed-length 16-bit codes with variable-length Huffman prefix codes, as rwong mentioned -- the resulting compressed file would be smaller, but the software to deal with those files would be a little slower and more complicated).

For more sophisticated techniques, you might look at
- MG4J: Managing Gigabytes for Java
- "Managing Gigabytes: Compressing and Indexing Documents and Images" by Ian H. Witten, Alistair Moffat, and Timothy C. Bell
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2021-02-04 19:41

This is possible, and can be done quite efficiently. There's a lot of exciting research on this topic, more formally known as a Succinct data structure. Some topics I would recommend looking into: Wavelet tree, FM-index/RRR, succinct suffix arrays. You can also efficiently search Huffman encoded strings, as a number of publications have demonstrated.

0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2021-02-04 19:44

I may be completely wrong here, but I don't think there'd be a reliable way to search for a given string without decoding the files. My understanding of compressions algorithms is that the bit-stream corresponding to a given string would depend greatly on what comes before the string in the uncompressed file. You may be able to find a given encoding for a particular string in a given file, but I'm pretty sure it wouldn't be consistent between files.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2021-02-04 19:48

Most text files are compressed with one of the LZ-family of algorithms, which combine a Dictionary Coder together with an Entropy Coder such as Huffman.

Because the Dictionary Coder relies on a continuously-updated "dictionary", its coding result is dependent on the history (all codes in the dictionary that is derived from the input data up to the current symbol), so it is not possible to jump into a certain location and start decoding, without first decoding all of the previous data.

In my opinion, you can just use a zlib stream decoder which returns decompressed data as it goes without waiting for the entire file to be decompressed. This will not save execution time but will save memory.

A second suggestion is to do Huffman coding on English words, and forget about the Dictionary Coder part. Each English word gets mapped to a unique prefix-free code.

Finally, @SHODAN gave the most sensible suggestion, which is to index the files, compress the index and bundle with the compressed text files. To do a search, decompress just the index file and look up the words. This is in fact an improvement over doing the Huffman coding on words - once you found the frequency of words (in order to assign the prefix code optimally), you have already built the index, so you can keep the index for searching.

0 讨论(0)
发布评论:

提交评论
- 加载中...
误落风尘

2021-02-04 19:56

It is unlikely you'll be able to search for uncompressed strings in a compressed file. I guess one for your best options is to index the files somehow. Using Lucene perhaps?

0 讨论(0)
发布评论:

提交评论
- 加载中...